Runtime  Speculative  Software-only 


Fault  Tolerance 


Yun  Zhang 


A  Dissertation 
Presented  to  the  Faculty 
of  Princeton  University 
in  Candidacy  for  the  Degree 
of  Doctor  of  Philosophy 


Recommended  for  Acceptance 
by  the  Department  of 
Computer  Science 


Adviser:  Professor  David  I.  August 


June  2012 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

JUN  2012  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2012  to  00-00-2012 

4.  TITLE  AND  SUBTITLE 

Runtime  Speculative  Software-Only  Fault  Tolerance 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROIECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Princeton  University, Department  of  Computer 

Science, Princeton, NJ, 08540 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

Transient  faults  are  emerging  as  a  critical  reliability  concern  for  modern  microprocessors.  Recently, 
microprocessors  have  been  designed  with  lower  voltage  level , smaller  and  faster  transistors  enabled  by 
improved  fabrication  technology.  A  combination  of  increased  density  of  transistors  on  chip,  reduced  noise 
margin  of  each  transistor,  and  voltage  scaling  are  making  hardware  systems  more  susceptible  to  transient 
faults  than  ever.  Both  hardware  or  software  solutions  have  been  proposed  for  transient  fault  tolerance.  The 
hardware  approach  typically  adds  redundant  hardware  modules  to  the  system,  thus  requiring  extra  chip 
area  as  well  as  higher  hardware  design  and  verification  cost.  In  addition  the  scope  and  mechanism  of  fault 
tolerance  are  hardwired  at  design  time,  which  could  be  suboptimal  with  the  change  of  deployment 
environment.  Unlike  hardware  solutions  software-only  techniques  do  not  require  any  specialized  hardware 
extensions  and  are  more  flexible  with  the  scope  of  protection  and  the  change  of  environment.  However, 
even  the  best-performing  software-only  fault  tolerance  techniques  incur  significant  performance  cost.  The 
overhead  of  prior  work  comes  from  doubled  register  usage,  frequent  inter-core  communication,  or  barrier 
synchronizations.  These  factors  prevent  existing  software  techniques  from  being  adopted  widely.  To 
address  these  problems,  this  dissertation  proposes  Runtime  Software-only  Speculative  Fault  Tolerance 
(RSFT).  The  key  insights  behind  this  dissertation  are:  (1)  not  all  values  are  equally  important.  Transient 
faults  may  alter  a  transistor?s  value,  which  is  never  used.  Only  the  values  that  will  affect  the  externally 
visible  behavior  of  a  program  must  be  verified  before  being  used;  (2)  Value  speculation  can  efficiently 
remove  data  dependences  introduced  by  cross  checking  values  produced  in  the  program  and  its  redundant 
copy  with  high  confidence,  thus  significantly  improves  program  runtime  performance.  RSFT  serves  as  a 
virtual  layer  between  the  application  and  the  underlying  platform.  It  takes  a  program  binary  and 
designated  execution  arguments  as  input,  and  automatically  creates  two  symmetric  program  instances  for 
redundant  execution,  to  utilize  extra  cores  in  a  multi-core  system.  RSFT  detects  transient  faults  at  system 
calls  level  in  a  non-invasive  way  and  exploits  high-confidence  value  speculation  to  achieve  low  runtime 
overhead.  Lightweight  runtime  checkpointing  and  background  validation  work  together  to  provide 
transient  fault  recovery  with  only  6.17%  overhead.  The  prototype  of  this  framework  was  implemented  and 
evaluated  on  a  commodity  multi-core  system.  The  evaluation  demonstrated  that  with  this  framework, 
transient  fault  tolerance  can  achieve  best-in-class  performance  full  fault  coverage,  and  fast  recovery  with 
no  hardware  module  involved. 


15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

Same  as 
Report  (SAR) 

18.  NUMBER 
OF  PAGES 

103 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


©  Copyright  by  Yun  Zhang,  2012. 
All  Rights  Reserved 


Abstract 


Transient  faults  are  emerging  as  a  critical  reliability  concern  for  modern  microproces¬ 
sors.  Recently,  microprocessors  have  been  designed  with  lower  voltage  level , smaller  and 
faster  transistors  enabled  by  improved  fabrication  technology.  A  combination  of  increased 
density  of  transistors  on  chip,  reduced  noise  margin  of  each  transistor,  and  voltage  scaling 
are  making  hardware  systems  more  susceptible  to  transient  faults  than  ever. 

Both  hardware  or  software  solutions  have  been  proposed  for  transient  fault  tolerance. 
The  hardware  approach  typically  adds  redundant  hardware  modules  to  the  system,  thus 
requiring  extra  chip  area  as  well  as  higher  hardware  design  and  verification  cost.  In  ad¬ 
dition,  the  scope  and  mechanism  of  fault  tolerance  are  hardwired  at  design  time,  which 
could  be  suboptimal  with  the  change  of  deployment  environment.  Unlike  hardware  solu¬ 
tions,  software-only  techniques  do  not  require  any  specialized  hardware  extensions  and  are 
more  flexible  with  the  scope  of  protection  and  the  change  of  environment.  However,  even 
the  best-performing  software-only  fault  tolerance  techniques  incur  significant  performance 
cost.  The  overhead  of  prior  work  comes  from  doubled  register  usage,  frequent  inter-core 
communication,  or  barrier  synchronizations.  These  factors  prevent  existing  software  tech¬ 
niques  from  being  adopted  widely. 

To  address  these  problems,  this  dissertation  proposes  Runtime  Software-only  Specu¬ 
lative  Fault  Tolerance  (RSFT).  The  key  insights  behind  this  dissertation  are:  (1)  not  all 
values  are  equally  important.  Transient  faults  may  alter  a  transistor’s  value,  which  is  never 
used.  Only  the  values  that  will  affect  the  externally  visible  behavior  of  a  program  must  be 
verified  before  being  used;  (2)  Value  speculation  can  efficiently  remove  data  dependences 
introduced  by  cross  checking  values  produced  in  the  program  and  its  redundant  copy  with 
high  confidence,  thus  significantly  improves  program  runtime  performance. 

RSFT  serves  as  a  virtual  layer  between  the  application  and  the  underlying  platform. 
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It  takes  a  program  binary  and  designated  execution  arguments  as  input,  and  automatically 
creates  two  symmetric  program  instances  for  redundant  execution,  to  utilize  extra  cores  in  a 
multi-core  system.  RSFT  detects  transient  faults  at  system  calls  level  in  a  non-invasive  way, 
and  exploits  high-confidence  value  speculation  to  achieve  low  runtime  overhead.  Light¬ 
weight  runtime  checkpointing  and  background  validation  work  together  to  provide  tran¬ 
sient  fault  recovery  with  only  6.17%  overhead.  The  prototype  of  this  framework  was  im¬ 
plemented  and  evaluated  on  a  commodity  multi-core  system.  The  evaluation  demonstrated 
that  with  this  framework,  transient  fault  tolerance  can  achieve  best-in-class  performance, 
full  fault  coverage,  and  fast  recovery  with  no  hardware  module  involved. 
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Chapter  1 


Introduction 


Reliability  is  one  of  the  more  critical  concerns  of  a  computer  system.  Computations  on 
a  hardware  system  are  supposed  to  be  calculating  the  correct  value,  as  fast  as  possible. 
However,  computer  systems  may  fail  due  to  hardware  errors  or  transient  faults.  Transient 
faults,  also  known  as  soft  errors,  are  caused  by  either  environmental  events,  such  as  particle 
strikes,  or  fluctuating  power  supply,  and  are  nearly  impossible  to  reproduce.  Transient 
faults  are  not  necessarily  attributed  to  design  flaws  and  occur  randomly  after  deployment. 
These  faults  do  not  cause  permanent  hardware  damage,  but  may  result  in  a  complete  system 
failure  or  data  corruption. 

In  1978,  Intel  Corporation  first  reported  transient  faults  occurrence  when  its  chip  pack¬ 
aging  modules  were  contaminated  by  uranium  from  a  mine  nearby.  Hewlett  Packard  re¬ 
ported  that  the  servers  in  Los  Alamos  National  Laboratory  were  frequently  crashing  from 
transient  faults  resulted  from  cosmic  ray  strikes  [28].  IBM  S/390  [51],  Boeing  777  air¬ 
planes  [65],  and  HP’s  Himalaya  [18]  all  incorporate  redundant  hardware  for  fault  detection 
and  recovery. 

While  transient  faults  are  already  a  concern  of  modem  computer  systems,  they  will  be¬ 
come  a  bigger  problem  in  the  future  generations  of  architectures.  As  semiconductor  tech- 
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nology  continues  to  scale,  the  number  of  transistors  on  a  single  chip  grows  exponentially. 
As  the  chip  area  remains  relatively  constant,  the  density  of  transistors  on  a  single  chip  is 
increased  significantly.  While  the  increasing  number  of  transistors  benefits  processor  per¬ 
formance,  Increasing  density  of  chips  increases  the  chance  that  a  particle  strike  affecting 
one  transistor.  Furthermore,  the  exponential  reduction  in  transistor  size  and  reduced  noise 
margin  of  each  transistor  make  them  even  less  reliable.  Moreover,  extreme  demands  for 
energy  efficiency  drive  aggressive  voltage  scaling,  which  leads  to  an  even  lower  noise  mar¬ 
gin  therefore  less  reliability.  All  of  these  technology  trends  make  processor  chips  more 
susceptible  to  transient  faults  than  ever  before. 

Due  to  the  emerging  requirement  for  reliable  systems,  transient  fault  tolerance  has  be¬ 
come  one  of  the  critical  concerns  in  the  semiconductor  industry.  A  more  recent  study  shows 
that  a  BlueGene/L  machine  with  104  nodes  deployed  in  Lawrence  Livermore  National  Labs 
experiences  soft  errors  once  every  four  hours  [9] .  For  fast  and  reliable  computation,  it  is 
critical  to  find  efficient  and  low-overhead  transient  fault  tolerance  solutions  for  modern  and 
future  architectures. 


1.1  Limitations  of  Existing  Fault  Tolerance  Techniques 

Due  to  the  nature  of  transient  faults,  the  common  practice  for  fault  tolerance  is  redundancy. 
Either  the  hardware  module  itself,  or  the  computation  on  top  of  the  hardware,  is  duplicated 
and  compared  against  each  other  to  verify  the  correctness  of  values.  There  have  been  two 
approaches,  hardware  and  software.  The  hardware  approach  either  duplicates  hardware, 
or  uses  additional  hardware  to  duplicate  software  execution,  and  the  software  approach 
duplicates  program  execution  on  the  system. 

Hardware  solutions  usually  introduce  extra  hardware  components,  typically  specialized 
for  processors  or  storage  systems.  For  example,  caches  and  memory  subsystems  include 
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extra  information  of  error-correcting  codes  (ECC)  to  allow  hardware  checking  for  transient 
faults  and  recover  at  runtime.  These  bit-level  techniques  typically  can  protect  values  in 
memory  subsystems,  such  as  caches  or  main  memory,  against  transient  faults.  But  they  are 
prohibitively  expensive  to  be  applied  to  the  processors,  due  to  the  nature  of  frequent  data 
updating  in  processors.  Previous  work  on  protecting  register  files  using  ECC  is  extremely 
costly  in  terms  of  both  performance  [57]  and  power  [38].  This  approach  cannot  be  applied 
to,  for  example  Arithmetic  Logic  Unit  (ALU),  without  paying  a  significant  penalty  in  chip 
area,  power  consumption  or  performance. 

To  solve  the  cost  problem  of  ECC  protection  for  processors,  semiconductor  industry  has 
introduced  redundancy  for  processor  cores  or  hardware  contexts  [15,  18,  30,  41,  46,  47,  51, 
61]  to  provide  transient  fault  tolerance.  Compared  with  the  ECC  or  parity  approaches,  this 
hardware  redundancy  does  not  have  to  validate  data  at  each  computation  and  update  parity 
information.  These  techniques  can  validate  less  frequently  and  reduces  extra  hardware 
required  for  error  correction.  This  is  commonly  deployed  in  reliability-critical  systems, 
especially  server  systems,  such  as  the  Compaq  NonStop  Himalya,  IBM  S390,  and  Boeing 
777. 

However,  all  hardware  approaches  involve  extra  hardware  components,  thus  adding 
higher  cost  at  design  and  validation  time.  In  addition,  the  scope  and  mechanism  of  protec¬ 
tion  are  hardwired  at  design  time  under  an  assumed  failure  model  (e.g.  single  event  upset 
model),  and  working  environment  (e.g.  reference  altitude),  which  may  be  suboptimal  de¬ 
pending  on  deployment  environments.  Some  hybrid  techniques  combine  custom  hardware 
extension  and  software  redundancy  [41,  60]  for  fault  detection.  Because  of  the  hardware 
extensions,  these  approaches  have  the  same  limitations  as  the  hardware  techniques. 

Current  architectural  trends  toward  multicore  microprocessors  naturally  provide  addi¬ 
tional  computing  resources,  thus  making  software  redundant  execution  more  viable  than 
ever.  Existing  software  proposals  [35,  44,  49,  60,  67]  typically  insert  redundant  code  into 
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a  program  at  compile  time  or  runtime,  and  check  for  transient  faults  at  runtime.  Among 
these  proposals,  compiler-based  techniques  [44,  49,  60,  67,  12]  are  only  applicable  to  pro¬ 
grams  whose  source  codes  are  available.  Separately  compiled  modules,  such  as  libraries, 
cannot  be  protected  using  compiler-based  techniques  due  to  the  absence  of  source  code  at 
compile  time.  Additionally,  these  techniques  are  not  applicable  to  legacy  binaries,  due  to 
the  absence  of  original  source  code,  or  compiler  compatibility  problems.  Runtime  tech¬ 
niques,  such  as  [49],  use  dynamic  instrumentation  to  instrument  program  binaries  for  fault 
detection  at  runtime.  But  this  approach  still  has  high  performance  overhead  due  to  the  cost 
of  dynamic  binary  instrumentation,  as  well  as  frequent  program  synchronizations. 

1.2  Research  Objectives  and  Contributions 

To  overcome  the  cost,  performance  and  applicability  limitations  of  the  previously  pro¬ 
posed  techniques,  this  dissertation  introduces  Runtime  Speculative  Software-only  Fault 
Tolerance(RSFT),  a  comprehensive  framework  that  efficiently  protect  program  execution 
against  transient  faults  without  requiring  any  specialized  hardware  or  program  source  code. 
It  works  on  off-the-shelf  hardware  platforms  and  legacy  program  binaries. 

This  dissertation  achieves  its  objective  by  taking  advantage  of  the  following  insights: 

•  Not  all  values  are  equally  important.  Some  register  values  may  never  affect  the  out¬ 
put  of  a  program  or  change  the  control  flow  of  a  program.  Applications’  inherent 
fault  tolerance  enables  RSFT  to  use  this  information  to  remove  unnecessary  value 
validation  code  to  minimize  the  runtime  validation  and  communication  cost. 

•  Cross  checking  values  produced  from  the  program  and  its  redundant  copy  creates 
data  dependences  between  these  two  instances,  and  introduces  barrier  synchroniza¬ 
tion.  Value  speculation  with  high  confidence  during  program  execution  can  effi- 


4 


ciently  remove  barrier  synchronizations  between  hardware  contexts,  and  significantly 
improve  program  performance. 

With  the  above  two  observations,  RSFT  speculates  results  of  predicatable  system  calls, 
so  that  the  original  program  does  not  have  to  synchronize  with  the  redundant  copy(ies) 
to  confirm  the  return  values.  This  allows  maximum  overlapping  of  program  execution, 
inter-thread/process  communication,  10  operations  and  transient  fault  detection. 

This  dissertation  first  presents  a  transient  fault  detection  technique  that  provides  effi¬ 
cient  fault  detection  during  program  runtime  without  requiring  program  modification  or 
recompilation.  RSFT-Detection  is  a  non-invasive  speculative  runtime  system  that  detects 
transient  faults.  RSFT-Detection  serves  as  a  light-weight  virtual  layer  between  an  appli¬ 
cation  and  the  underlying  platform.  It  takes  a  program  binary  as  input,  and  automatically 
executes  the  binary  redundantly  using  a  process  monitoring  tool  provided  by  the  operating 
system  to  trap  every  system  call.  The  arguments  of  the  system  calls  invoked  from  both  pro¬ 
gram  instances  are  compared  for  correctness.  A  value  mismatch  means  a  transient  fault  has 
occurred  and  RSFT  reports  this  to  the  user.  Unlike  some  compiler-enabled  techniques  that 
must  obtain  knowledge  of  library  functions  for  fault  detection,  RSFT  must  only  understand 
the  relatively  stable  and  well-defined  set  of  system  calls. 

To  recover  from  faulty  program  execution  and  continue  protecting  execution  from  tran¬ 
sient  faults  after  recovery,  this  dissertation  also  proposes  a  comprehensive  and  efficient 
fault  recovery  scheme,  named  RSFT-Recovery.  The  original  program  and  its  redundant 
copy  each  creates  a  checkpoint  of  themselves  at  the  beginning  of  the  program,  and  keep 
creating  checkpoints  periodically  during  execution.  The  runtime  light-weight  checkpoint¬ 
ing  is  created  using  system  call  fork’s  copy-on- write  provided  by  the  operating  system. 
One  checkpoint  includes  two  newly  forked  processes  that  are  stalled  for  future  use,  and 
their  register  files.  Once  a  checkpoint  is  created,  the  two  stalled  processes  are  probed  by 
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another  background  validation  process.  Their  register  files  as  well  as  memory  images  are 
compared  against  each  other  for  validation.  If  everything  is  identical,  this  checkpoint  is 
considered  a  valid  checkpoint,  otherwise  it  is  invalid  and  is  discarded.  At  program  run¬ 
time,  if  a  transient  fault  is  detected,  the  previous  valid  checkpoint  is  used  to  resume  correct 
program  execution  and  the  faulty  ones  are  discarded.  If  no  valid  checkpoint  is  available, 
the  program  rolls  back  to  the  very  beginning  and  restarts  execution  redundantly. 

Compare  with  hardware  or  hybrid  fault  tolerance  approaches,  RSFT  provides  compa¬ 
rable  performance.  However,  hardware  solutions  can  provide  protection  for  the  underlying 
operating  systems,  and  a  wider  range  of  hardware  modules.  For  mission-critical  systems, 
such  as  a  space  shuttle,  the  cost  associated  with  hardware  redundancy  is  not  a  major  con¬ 
cern,  but  the  speed  of  fault  recovery  and  the  fault  coverage  is  critical.  In  these  cases, 
hardware  solutions  suit  the  fault  tolerance  purpose  better  than  RSFT. 

RSFT  provides  a  cheap  and  effective  software-only  method  to  protect  applications  from 
transient  faults,  with  full  fault  coverage.  However,  redundant  execution  costs  extra  re¬ 
sources  to  achieve  the  fault  tolerance.  For  example,  RSFT  consumes  twice  as  much  mem¬ 
ory  and  CPU  cycles  as  the  original  unprotected  program.  RSFT  performs  best  when  applied 
to  applications  that  are  computation  intensive,  and  on  multi-core  or  distributed  architecture 
where  extra  unused  CPU  resource  is  available.  Scientific  computation  is  a  good  example 
of  a  typical  application  that  RSFT  works  best  on.  For  heavy  10  bound  applications,  RSFT 
may  add  considerable  runtime  overhead.  In  addition,  as  RSFT  does  not  provide  protection 
for  execution  out  of  an  application’s  user  space,  RSFT  has  wider  window  of  vulnerability 
for  those  applications.  These  programs  are  better  protected  using  hardware  solutions  than 
software  ones. 

One  type  of  10  bound  application  is  the  video/audio  decoding  applications,  such  as 
media  players.  These  application  have  high  tolerance  to  faults  naturally,  and  require  high 
processing  throughput.  The  high  fault  coverage  of  RSFT  is  not  necessary,  and  RSFT’s 
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runtime  misspeculation  cost  is  not  appealing  for  these  applications.  Compared  with  RSFT, 
other  software  solutions,  such  as  symptom-based  techniques,  will  introduce  lower  over¬ 
head,  and  provide  acceptable  level  of  fault  tolerance. 

A  prototype  of  RSFT  is  implemented  and  evaluated  in  this  dissertation.  To  study  the 
reliability  of  RSFT,  a  in-depth  analysis  on  its  window  of  vulnerability  is  also  discussed  and 
measured  via  simulated  fault  injection.  The  performance  of  RSFT  with  fault  detection  and 
recovery  are  separately  evaluated  on  a  commodity  hardware  system. 

It  must  be  noted  that  the  implementation  of  RSFT  in  this  thesis  does  not  support 
multi-threaded  programs.  However,  this  implementation  can  be  extended  to  support  multi¬ 
threaded  programs  with  deterministic  outputs.  For  applications,  where  non-deterministic 
outputs  are  acceptable,  RSFT  may  raise  false  alarms.  Research  on  applying  RSFT  to  these 
applications  and  redefining  faulty  behavior  for  programs  with  non-deterministic  outputs 
remains  a  future  work  of  this  thesis. 

The  contributions  of  this  thesis  are: 

•  Design  and  implementation  of  RSFT,  the  fastest  transient  fault  tolerance  technique 
known  to  date.  RSFT  delivers  full  transient  fault  coverage  with  a  geomean  perfor¬ 
mance  overhead  of  6.17%  for  23  SPEC  CPU  benchmarks  using  a  commodity  multi¬ 
core  system. 

•  A  detailed  evaluation  of  the  reliability  of  RSFT  by  injecting  transient  faults  into  both 
register  files  and  memory  space.  This  work  is  the  first  to  simulate  and  evaluate  the 
reliability  of  a  software  transient  fault  tolerance  technique  against  memory  faults. 

•  A  transient  fault  tolerance  framework  that  does  not  require  any  program  modification 
and  recompilation. 
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1.3  Dissertation  Organization 


Chapter  2  provides  background  information  on  transient  faults,  the  state-of-the-art  tran¬ 
sient  fault  tolerance  techniques  and  their  limitations.  Chapter  3  and  Chapter  4  propose 
software-only  speculative  transient  fault  detection  and  runtime  checkpointing  and  recov¬ 
ery  techniques,  respectively.  Their  design  and  implementation  details  are  described  in 
these  chapters  as  well.  Chapter  5  introduces  the  methods  used  for  evaluation  in  this  thesis, 
comparing  with  common  practice  in  the  field  and  provides  rationale  behind  these  metrics. 
Chapter  6  summaries  a  broad  quantitative  evaluation  of  RSFT  on  a  commodity  multi-core 
system.  Finally,  Chapter  8  concludes  this  dissertation  and  discusses  future  directions  of 
research. 
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Chapter  2 


Background 


Transient  faults  are  emerging  as  a  critical  reliability  concern  in  microprocessors.  These 
faults  are  usually  caused  by  external  events  such  as  particle  strikes,  or  internal  events, 
such  as  power  fluctuation  [4,  34,  43,  48].  These  events  can  cause  additional  charge  to  be 
deposited  and  therefore  alter  the  value  of  a  single  transistor.  Transient  faults  do  not  result  in 
permanent  hardware  damage,  but  may  lead  to  system  failures  or  affect  program  execution. 

2.1  Transient  Faults  Problem 

Transient  faults  have  already  caused  significant  failures  in  deployed  computer  systems.  Sun 
Microsystems,  for  one,  acknowledges  that  customers  such  as  America  Online  (AOL),  eBay, 
and  Los  Alamos  National  Labs  have  experienced  system  failures  due  to  transient  faults  [5]. 
In  2005,  a  2048-CPU  Hewlett  Packard  server  system  in  Los  Alamos  National  Laboratory 
was  frequently  crashing  because  of  transient  faults  caused  by  cosmic  ray  strikes  [28].  In 
addition  to  causing  machine  crashes,  recent  research  has  also  shown  that  soft  errors  can 
lead  to  the  derivation  of  secret  keys  in  RSA  public -key  cryptography  [3,  7]  and  induce 
security  vulnerabilities  [16]. 
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Intel  [54]  and  IBM  [68]  noted  that  radioactive  materials  can  change  the  values  in  storage 
devices  of  processors.  Manufacturing  process  has  been  taking  caution  to  protect  hardware 
from  being  exposed  to  radiation.  While  the  fault  rate  per  bit  remains  relatively  constant 
over  technology  generations  [17],  exponentially  growing  transistor  counts,  combined  with 
aggressive  voltage  scaling,  make  microprocessors  more  susceptible  to  transient  faults  than 
ever  before.  Additionally,  cosmic  radiation  effects  on  modem  microprocessors  are  hard  to 
trace  and  evaluate  because  it  changes  with  the  system’s  deployment  environment,  such  as 
altitude,  geography  and  periodic  solar  phases.  A  majority  of  solar  radiation  is  deflected 
from  the  planet  surface  by  the  Earth’s  magnetic  field.  The  amount  of  radiation  can  vary 
significantly  depending  on  the  geographic  location. 

It  is  also  difficult  to  measure  the  failure  rate  after  deployment  due  to  the  unpredicatablity 
of  transient  faults.  Previous  research  has  shown  that  average  failure  rate  per  bit  is  between 
0.01  to  0.001  per  109  hours[17,  22,  32,  56].  SPARC64  was  reported  to  have  80%  of  its 
200,000  latches  covered  by  some  form  of  fault  protection  [2]. 

Future  generations  of  processors  will  be  even  more  susceptible  to  transient  faults.  Com¬ 
peting  factors  for  the  reliability  of  individual  bits  in  a  processors  suggest  the  failure  rate  per 
bit  will  remain  constant  for  the  next  few  generations  [17,  22].  The  reduction  in  transistor 
size  decreases  the  probability  that  cosmic  radiation  will  strike  one  single  transistor,  but 
each  transistor  also  has  a  reduced  critical  voltage  (the  charge  necessary  to  change  its  val¬ 
ues),  increasing  the  likelihood  that  a  particle  strike  will  affect  the  stored  state.  While  the 
failure  rate  per  bit  is  staying  roughly  constant,  the  failure  rate  per  processor  is  increasing 
proportional  to  the  number  of  transistors  on  a  chip,  which  increases  at  an  exponential  rate. 

Voltage  scaling  is  also  one  factor  that  may  increase  the  fault  rate  per  transistor  in  the 
future.  Active  power  consumption1  of  a  chip  has  quadratic  relationship  with  the  supply 
voltage.  As  a  result,  the  tightening  power  consumption  constraint  requires  the  aggressive 
'Active  power  consumption  is  the  total  power  consumed  minus  the  leakage  power 
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voltage  scaling.  Figure  2.1  shows  the  voltage  scaling  trend  over  generations  of  micro¬ 
processors  [8].  This  technology  trend  leads  to  reduced  noise  margins,  which  makes  tran¬ 
sistors  more  vulnerable  against  particle  strikes  and  power  fluctuation. 

The  soft  error  rate  per  chip,  including  processor  logic  and  on-chip  memory  subsystems, 
is  increasing  with  the  decrease  of  technology  node.  As  shown  in  Figure  2.2,  processors 
produced  with  the  32nm  technology  is  50%  more  vulnerable  than  the  ones  made  with 
180nm  [8].  As  technology  leans  toward  smalled  logic  units,  the  soft  error  rate  will  continue 
to  scale  up.  In  2004,  Borkar  et  al.  estimates  about  8%  reliability  degradation  per  bit  per 
generation  of  microprocessors.  It  is  getting  more  and  more  difficult  to  design  and  produce 
computer  systems  with  reliable  components  today,  and  even  so  in  the  future. 
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Figure  2.2:  Technology  trend  and  soft  error  failures  in  time.  *Data  source:  Intel 


2.2  Existing  Methods  for  Transient  Fault  Tolerance 


Fault  tolerance  solutions  have  been  proposed  to  detect  transient  faults  during  program  exe¬ 


cution  via  redundant  computation,  and  recovery  from  the  faults.  The  sphere  of  replication 


(SoR)  [41]  is  used  to  identify  the  scope  of  fault  coverage.  Values  that  enter  the  SoR  must 


be  replicated  for  redundancy  and  values  that  exit  the  SoR  must  be  checked  for  faults  to 


ensure  their  correctness.  The  implication  of  SoR  is  the  scope  of  transient  fault  protection. 


All  fault  tolerance  techniques  proposed  so  far  depends  on  redundancy,  either  in  space  or 


time.  Table  2.1  lists  most  recent  existing  representative  fault  detection  techniques. 


2.2.1  Hardware  Redundancy 


Hardware  redundancy  provides  transparent  fault  tolerance  using  extra  specialized  hard¬ 


ware,  such  as  watchdog  processor  in  [26].  In  reality,  IBM  S/390  [51],  Boeing  111  air- 


Soft  Error  Failure  in  Time  Rate  per  Chip 
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Approach 

Technique 

Main 

Memory 

Usage 

Need 

Source 

Code 

Hardware 

Contexts 

Sphere  of 
Replication 

Reported 

Overhead 

Detection 

Recovery 

Processor 

Memory 

Detection 

Recovery 

Specialized 

Hardware 

SWAT  [25] 

lx 

No 

2 

2 

Most 

None 

5% 

AR-SMT  [46] 

lx 

No 

8 

- 

All 

None 

16.7%+ 

- 

CRT  [30] 

lx 

No 

2 

- 

Most 

None 

Unreported 

- 

SRT  [41] 

lx 

No 

2 

- 

Most 

None 

Unreported 

- 

Hybrid-SRMT  [60] 

lx 

Yes 

2 

- 

Most 

None 

19%+ 

- 

Thread-local 

Duplication 

EDDI  [35] 

2x 

Yes 

1 

- 

Most 

All 

52.2% 

- 

SWIFT  [44] 

lx 

Yes 

1 

- 

Most 

None 

45%+ 

- 

Shoestring  [12] 

lx 

Yes 

1 

- 

Most 

None 

118%+ 

- 

Redundant 

Multi-Threading 

SRMT  [60] 

lx 

Yes 

2 

- 

Most 

None 

400% 

- 

DAFT  [67] 

lx 

Yes 

2 

- 

Most 

None 

38% 

- 

Process-based 

Redundancy 

PLR  [49] 

2x 

No 

2 

3 

Most 

Some 

16.9% 

41% 

RSFT[This  Thesis] 

2x 

No 

2 

2 

Most 

All 

3.54% 

6.17% 

Table  2.1:  Comparison  Among  Transient  Fault  Tolerance  Techniques,  t  indicates  the  results 
was  obtained  on  a  simulator,  not  commodity  hardware. 


planes  [65],  and  HP’s  Himalaya  [18]  all  incorporate  triple-redundant  hardware  for  fault 
detection  and  recovery  on-the-fly. 

However,  redundant  execution  in  custom  hardware  can  increase  the  transistor  count  of 
a  processor  by  20-30%.  This  also  leads  to  extra  chip  area  and  verification  cost  [2,  51].  For 
example,  AR-SMT  [46],  SRT  [41],  and  CRT  [30]  all  use  at  least  one  or  more  processing  unit 
for  transient  fault  detection  only.  Additionally,  the  scope  and  mechanism  of  protection  are 
hardwired  at  design  time  under  an  assumed  failure  model  (e.g.  single  event  upset  model), 
and  working  environment  (e.g.  reference  altitude),  which  may  be  suboptimal  depending  on 
deployment  environments. 

One  hardware  solution,  named  SWAT  [25],  is  a  symptom-based  fault  tolerance  tech¬ 
nique.  It  detects  transient  faults  by  observing  the  program  behavior  at  runtime  using  an 
assumed-to-be  faulty-free  processor.  If  a  program  crashes,  SWAT  will  report  it  and  recover 
from  a  previous  checkpoint,  which  is  also  assumed  to  remain  intact  after  the  checkpoint  is 
made.  This  approach  also  does  not  handle  Byzantine  failures  2,  which  is  more  difficult  to 
detect  than  program  crash. 

Early  multi-threaded  fault  detection  techniques  rely  on  specialized  hardware  to  execute 

2Byzantine  failures  means  program  runs  and  completes  even  it  is  affected  by  faults.  Programs  with  this 
kind  of  failure  does  not  crash,  but  will  produce  erroneous,  random  or  even  malicious  results. 
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redundant  copies  of  the  program  for  transient  fault  detection  and  recovery.  Rotenberg’s 
AR-SMT  [46]  is  the  first  technique  to  use  simultaneous  multi-threading  for  transient  fault 
detection.  An  active  thread  (A)  and  a  redundant  thread  (R)  execute  the  same  program  at 
runtime,  and  their  computation  results  are  compared  to  detect  transient  faults.  This  method 
uses  an  8-way  simultaneous  multi-threading  trace  processor  to  achieve  its  purpose.  Simul¬ 
taneous  Redundant  Threading  (SRT)  [41]  and  Chip-level  Redundant  Threading  (CRT)  [30] 
exploit  simultaneous  multi-threaded  processors  and  multiple  cores  respectively  for  redun¬ 
dant  execution  and  value  checking.  These  techniques  use  duplicate  hardware  modules, 
and  check  values  when  they  escape  the  SoR  for  fault  detection.  Limited  by  its  specialized 
hardware  requirement,  these  approaches  are  also  not  widely  adopted. 

For  caches  and  memory  subsystems,  extra  information  of  error-correcting  codes  (ECC) 
is  a  common  practice  to  allow  hardware  checking  for  transient  faults  and  recover  at  run¬ 
time.  This  bit-level  technique  typically  protects  values  in  memory  subsystems,  such  as 
caches  or  main  memory,  by  encoding  the  bits  and  verifies  the  value  against  transient  faults. 
Although  this  is  effective  to  tolerant  single -bit-flip  transient  faults,  they  do  not  work  well 
against  multiple  bit  flip  events.  Additionally,  they  are  prohibitively  expensive  to  be  ap¬ 
plied  to  the  processors,  or  on-chip  caches,  due  to  the  nature  of  frequent  data  updating  in 
processors.  Previous  work  on  protecting  register  files  using  ECC  is  extremely  costly  in 
terms  of  both  performance  [57]  and  power  consumption  [38].  Similar  results  have  been 
found  for  on-chip  cache  ECC  protection  as  well.  Systems  with  ECC  protected  caches  suf¬ 
fers  nearly  2%  performance  degradation,  and  15%  chip  area  increase.  With  the  increase 
of  on-chip  cache  size  and  exponentially  increasing  transistor  count  of  processors,  protect¬ 
ing  processors  and  on-chip  caches  means  paying  a  significant  penalty  in  chip  area,  power 
consumption,  performance,  or  all  of  the  above. 
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2.2.2  Software  Redundancy 

Compared  with  hardware  redundancy  approaches,  software-only  solutions  are  more  ap¬ 
pealing  for  its  cost-efficiency  and  flexibility  [15,  30,  44,  60,  35].  Although  software  solu¬ 
tions  also  use  some  extra  resources  for  redundancy,  they  usually  do  not  require  specialized 
hardware  modules.  Instead,  these  techniques  exploit  redundancy  using  existing  under¬ 
employed  processors,  register  files  or  cores  to  achieve  redundancy.  Software  transient  fault 
detection  techniques  typically  fall  into  three  categories:  thread-local  duplication,  redundant 
multi-threading  and  process-based  redundancy,  as  shown  in  Table  2.1.  Thread-local  dupli¬ 
cation  techniques  such  as  EDDI  [35]  and  SWIFT  [44]  redundantly  execute  instructions 
within  a  single  thread,  exploiting  instruction-level  parallelism  to  improve  performance. 
Shoestring  [12]  combines  symptom-based  fault  detection  with  selective  instruction  dupli¬ 
cation  to  achieve  lower  overhead  than  both  EDDI  and  SWIFT,  but  with  lower  fault  cover¬ 
age.  Redundant  multi-threading  techniques  (e.g.  SRMT  [60]  and  DAFT  [67])  use  multiple 
threads  to  execute  program  codes  redundantly.  Process-based  redundant  techniques  (e.g. 
PER  [49])  use  multiple  processes  instead  of  threads,  at  the  cost  of  maintaining  multiple 
memory  states. 

All  these  techniques  are  typically  implemented  using  either  compiler  transformations 
or  runtime  systems.  Compiler-based  approaches  toward  transient  fault  detection,  such  as 
EDDI  [35],  SWIFT  [44],  SRMT  [60],  DAFT  [67],  and  Shoestring  [12],  all  require  program 
source  code  for  recompilation,  and  cannot  detect  any  transient  fault  occurring  in  separately- 
compiled  modules. 

EDDI  [35]  provides  fault  tolerance  for  programs  via  instruction  duplication  within  the 
same  thread.  Each  memory  location  has  a  corresponding  shadow  location  in  memory  for 
duplicated  instruction,  which  increases  memory  pressure.  SWIFT  [44]  also  exploits  unused 
computing  power  of  multiple-issue  processors  by  duplicating  program  execution  within 
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A:  for  (iter  =  1;  iter  <=  timesteps;  iter++)  { 

B:  computeMatrix (  )  ;  int  printf  (const  char  *format,  ...)  { 

C:  printf ("%d:  %.2e  %.2e  %.2e\n" ,  D: 

Node[src],  Matrix [pos ][ 0 ] ,  syscall ( sys_write ) ; 

Matrix [pos ][ 1 ] ,  Matrix [ pos ][ 2 ]) ;  E: 

F:  updateValues ( ) ;  } 

} 

Figure  2.3:  Simplified  Code  Example  from  SPECINT  2000  Benchmark  183.equake 

the  same  thread.  Achieving  low  runtime  overhead  for  fault  detection,  SWIFT-transformed 
codes  require  twice  as  many  registers,  potentially  causing  register  spills.  In  SWIFT,  mem¬ 
ory  operations  can  only  be  performed  once  due  to  potential  side  effects.  Transient  fault 
checking  instructions  must  be  inserted  before  every  memory  operation.  Unlike  SWIFT, 
RSFT  exploits  multiple  cores  for  redundant  execution  to  minimize  runtime  overhead,  and 
utilizes  separate  memory  spaces  provided  by  multiple  processes  to  eliminate  per-memory 
operation  fault  checking. 

Software-based  Redundant  Multi-threading  (SRMT)  is  a  software  solution  that  achieves 
redundancy  with  multiple  threads.  The  SRMT  techniques  uses  compiler  transformation  to 
automatically  generate  redundant  code  for  runtime  fault  detection.  However,  due  to  the 
single  memory  state  maintained  during  execution,  redundancy  is  lost  at  memory  opera¬ 
tions.  These  techniques  cannot  issue  redundant  store  instructions  because  only  one  shared 
memory  state  is  maintained.  Before  a  memory  operation  is  executed,  its  operands  are  com¬ 
municated  between  threads  and  checked  for  consistency.  Consequently,  frequent  barrier 
synchronization  is  required  and  adds  significant  performance  cost.  When  a  real  transient 
fault  triggers  an  exception,  SRMT  invokes  the  program’s  exception  handler  to  catch  the 
fault,  registering  a  false  positive  and  possibly  changing  the  program’s  behavior. 

Compared  with  compiler-based  techniques,  runtime  techniques  do  not  require  any  pro¬ 
gram  source  code  to  be  recompiled  and  can  detect  transient  faults  for  separately  compiled 
modules  as  well.  One  such  implementation  called  PER  [49],  a  dynamic  instrumentation 
technique,  provides  transient  fault  detection  with  the  minimum  runtime  overhead  (16.9%) 
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Figure  2.4:  Execution  plan  of  transient  fault  detection  without  and  with  speculation  for 
example  program  in  Figure  2.3  with  timesteps  =  3.  The  execution  time  of  same  in¬ 
struction  blocks,  such  as  Bl,  B2,  are  slightly  different  across  both  processes  because  of 
various  runtime  factors  such  as  cache  behavior  and  process  scheduling. 
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among  all  software  solutions  with  full  coverage.  This  technique  duplicates  the  original 
program  into  several  instances  at  runtime,  maintaining  one  private  memory  space  for  each 
instance.  Only  externally  visible  values  need  to  be  verified  before  they  escape  user  space. 

PLR  duplicates  the  execution  of  the  application  and  its  libraries,  protecting  program 
execution  on  both  processors  and  memory,  but  not  in  the  operating  system.  As  a  result,  the 
arguments  of  a  system  call  escape  the  SoR  and  must  be  checked  before  the  system  call  is 
executed  in  kernel  mode.  Similarly,  the  return  value  of  the  system  call  is  an  input  to  the 
SoR,  and  needs  to  be  replicated  in  both  program  instances.  Therefore,  PLR  synchronizes 
the  main  and  the  redundant  processes  at  every  system  call  for  transient  fault  detection  and 
returns  replicated  value.  The  main  process  executes  the  system  call.  The  redundant  process 
only  resumes  execution  after  the  system  call  is  completed.  This  barrier  synchronization 
puts  inter-core  communication  on  the  critical  path  of  program  execution,  leading  to  slower 
performance. 

Other  techniques,  such  as  the  pi  bit  [64]  proposed  by  Weaver  et  al.  and  dependence- 
based  checking  [58]  by  Vijaykumar  et  al.  detect  faults  that  affect  program  behavior.  This 
is  done  by  following  the  propagation  of  faults  through  the  entire  program.  For  optimal 
performance,  these  techniques  have  to  use  herotic  alias  analysis  to  find  the  minimum  set  of 
dependences  that  fault  tolerance  code  must  track  at  runtime.  RSFT  achieves  the  same  goal 
by  only  detecting  non-benign  faults  only  when  they  are  about  to  affect  program  output.  No 
program  source  code  or  alias  analysis  is  needed  to  analyze  the  fault  propagation  ahead  of 
time. 


2.3  Limitations  of  Existing  Techniques 

Hardware  redundant  computing  requires  extra  chip  area,  extra  logic  units,  and  additional 
hardware  verification.  The  scope  of  protection  and  fault  detection  scheme  are  usually  hard- 
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wired  at  design  time,  which  limits  the  system’s  flexibility.  On  the  other  hand,  software 
redundancy  is  more  flexible  and  much  cheaper  in  terms  of  physical  resources.  Software 
approach  avoids  expensive  hardware  and  chip  development  costs.  The  ongoing  multicore 
design  provides  increasing  parallel  resources  in  hardware,  making  software  redundancy 
solutions  more  viable  than  ever. 

Although  software  approach  is  more  appealing,  the  significant  runtime  overhead  of  ex¬ 
isting  software  techniques  prohibits  the  previously  proposed  schemes  from  being  deployed 
on  commodity  multi-core  systems.  For  example,  SRMT  [60],  a  compiler- assisted  redun¬ 
dant  multi-threading  implementation,  adds  19%  performance  overhead,  even  with  special¬ 
ized  hardware  communication  queue.  PLR  [49],  as  an  example  of  runtime  technique  using 
process  level  redundant  execution,  adds  16.9%  average  runtime  overhead  for  un-optimized 
programs,  and  much  more  overhead  on  10  intensive  applications. 

This  dissertation  proposes  RSFT,  the  first  speculative  transient  fault  tolerance  frame¬ 
work  that  provides  the  fastest  transient  fault  detection  and  recovery.  RSFT  automatically 
duplicates  program  execution  in  a  non-invasive  way.  As  a  result,  RSFT  maintains  repli¬ 
cated  memory  states  and  has  the  SoR  that  covers  both  the  processor  cores  and  the  memory 
subsystems.  It  eliminates  frequent  barrier  synchronization  via  speculation,  and  provides 
efficient  recovery  mechanisms  after  a  transient  fault  is  detected.  Compared  with  previous 
work,  RSFT  yields  the  lowest  performance  overhead  for  transient  fault  detection  without 
compromising  fault  coverage. 

To  illustrate  the  advantage  of  RSFT  ,  Figure  2.3  shows  a  simplified  code  example  from 
183.equake,  a  SPECFP  2000  benchmark  program.  Figure  2.4  compares  non-speculative 
execution  plan  versus  speculative  execution  plan  in  RSFT.  These  execution  plans  demon¬ 
strate  that  barrier  synchronizations  add  considerable  runtime  overhead  to  the  program.  Al¬ 
though  both  process  instances  are  executing  the  same  program  binary,  the  cycles  spent  on 
executing  each  piece  of  code,  such  as  B 1 ,  B2 ,  .  .  . ,  are  not  the  same  because  of  various 
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runtime  factors  such  as  cache  behavior  and  process  scheduling.  Forcing  barrier  synchro¬ 
nization  at  every  system  call  accumulates  such  timing  difference,  consequently  slows  down 
the  whole  program  execution.  In  contrast,  RSFT  allows  one  process  to  speculate  the  return 
values  of  sys  write  system  call  without  actually  executing  it,  therefore  does  not  require 
waiting  till  the  other  process  to  invoke  the  same  system  call  and  the  barrier  synchronization 
between  several  program  instances.  If  a  misspeculation  occurs,  an  efficient  misspeculation 
recovery  scheme  is  employed  to  continue  execution  from  a  previous  verified  program  state. 
Combining  all  the  features  above,  RSFT  achieves  very  low  runtime  overhead  with  full  fault 
coverage. 

To  understand  and  compare  the  performance  of  PLR,  the  best  performing  software 
transient  fault  tolerance  technique,  this  thesis  also  implemented  an  approximation  of  PLR 
called  RSFT-NoSpec.  The  implementation  of  RSFT-NoSpec  faithful  replicates  the  original 
paper,  but  using  the  light-weight  processing  monitoring  tool  utilized  in  RSFT  ,  instead 
of  PIN,  the  dynamic  binary  instrumentation  employed  in  the  original  paper.  As  shown 
in  Figure  2.1,  RSFT-NoSpecis  a  lower-bound  approximation  of  PLR  with  regard  to  the 
performance  overhead. 
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Chapter  3 


Runtime  Speculative  Transient  Fault 
Detection 


This  chapter  introduces  the  transient  fault  detection  mechanism  in  RSFT  ,  called  RSFT- 
Detect  [66].  Compared  with  previously  proposed  fault  detection  work,  this  technique 
does  not  require  program  source  code,  runtime  instrumentation  or  program  recompilation. 
RSFT-Detect  is  a  fast  and  effective  transient  fault  detection  system  that  exploits  the  effi¬ 
ciency  and  functionality  of  an  OS-level  process  monitoring  tool  ptrace.  The  ptrace 
utility  is  a  POSIX  standard  that  is  provided  by  Linux/Unix,  Mac  OS,  and  Solaris  systems 
to  provide  process  monitoring  and  debugging  capabilities.  This  kind  of  monitoring  utility 
is  exploited  in  RSFT-Detect  as  a  method  of  trapping  system  calls  and  comparing  the  values 
of  their  arguments  to  detect  transient  faults.  This  approach  ensures  that  RSFT-Detect  can 
transparently  detect  transient  faults  occurring  in  a  non-invasive  way. 

This  chapter  first  discusses  the  existing  research  on  transient  fault  detection,  and  com¬ 
pares  RSFT-Detect  with  other  alternative  techniques.  Section  3.2  describes  the  overall 
structure  of  RSFT-Detect .  The  following  sections  details  the  technique  and  its  implemen¬ 
tation.  Section  3.11  discusses  the  window  of  vulnerability  of  RSFT-Detect,  and  compares 
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it  with  previous  works. 


3.1  Related  Work 

Software-only  transient  fault  detection  techniques  are  advantageous  compared  with  hard¬ 
ware  approaches  because  of  their  flexibility  and  no  extra  hardware  design  cost.  They  are 
typically  cheaper  and  easier  to  deploy  on  a  commodity  systems. 

Among  existing  software  transient  fault  detection  techniques,  many  proposals  use  the 
compiler  to  recompile  application  code  and  insert  proper  instructions  for  transient  fault 
detection.  While  these  techniques  can  obtain  a  global  view  of  the  program  code  and  in¬ 
struction  dependences  to  further  optimize  code  for  performance,  they  are  limited  by  the 
need  for  the  program  source  code.  Separately  compiled  modules,  such  as  existing  libraries, 
often  do  not  have  source  code  available  for  recompilation.  Thus,  these  separately  compiled 
modules  are  a  part  of  the  window  of  vulnerability  of  these  techniques.  Previous  compiler- 
based  fault  tolerance  techniques  do  not  evaluate  their  fault  coverage  when  processors  are 
running  library  code. 

Another  approach,  being  aware  of  such  limitations,  proposes  to  instrument  program 
binary  code  at  runtime  for  transient  fault  tolerance  [49].  Although  this  approach  protects 
program  and  its  dynamically  linked  binaries,  it  suffers  from  high  overhead  incurred  by  the 
dynamic  instrumentation.  For  example,  Pin  instrumentation  framework,  which  was  used  in 
PLR  [49],  is  reported  to  add  20-30%  runtime  overhead  on  average,  just  through  executing 
a  program  binary  within  the  framework.  The  evaluation  in  this  dissertation  runs  all  SPEC 
benchmark  programs  within  the  PIN  dynamic  instrumentation  framework,  and  proves  the 
estimates.  The  average  performance  overhead  of  PIN  is  44.54%  across  all  benchmark 
programs.  The  barrier  synchronization  in  PLR  further  adds  17.27%  more  overhead  on  top 
of  that. 
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3.2  Overview 


Unlike  all  previous  proposals,  RSFT-Detect  utilizes  a  process  monitoring  utility  (pt  race) 
to  intercept  system  calls  invoked  by  a  program.  As  a  light-weight  interface  between  the 
operating  system  and  user  applications,  ptrace  adds  very  little  runtime  overhead.  Fig¬ 
ure  3.1  demonstrates  the  overall  structure  of  RSFT-Detect  and  the  interaction  between  sev¬ 
eral  components  of  the  system.  RSFT-Detect  first  takes  the  program  binary  and  its  input, 
then  spawns  one  process  that  execute  the  binary  redundantly.  Upon  process  creation,  RSFT- 
Detect  immediately  pauses  that  process,  and  inject  a  fork  system  call  into  the  just-created 
child  process  App.  The  child  process  App  spawns  another  process  App'  from  itself,  in¬ 
heriting  ah  its  virtual  address  table  and  signal  handling  table.  This  is  critical  to  eliminate 
potential  false-positives,  especially  on  systems  with  address  space  randomization  enabled. 
From  then  on,  App  gives  up  its  parent-ship  of  App'  to  the  tracer  process.  Both  App  and 
App '  becomes  processes  that  are  traced  only  by  the  tracer  process  RSFT-Detect. 

During  the  execution,  RSFT-Detect  serves  as  a  virtual  layer  between  the  application 
and  the  underlying  OS  services  and  devices.  It  traps  every  system  call  invoked  by  either 
program  instance.  After  a  system  call  is  trapped,  the  process’s  register  hie  is  examined  to 
find  out  the  type  of  the  system  call.  RSFT-Detect  then  compares  the  system  call’s  argu¬ 
ments  against  those  in  the  other  program  instance  to  check  for  transient  faults,  if  available, 
according  to  its  specific  calling  context.  If  the  system  call  reads  a  process’  memory  space 
through  pointer  arguments,  the  memory  content  is  also  checked.  If  no  transient  fault  is 
detected,  RSFT-Detect  executes  the  system  call  and  lets  the  program  instance  continue  ex¬ 
ecution.  This  runtime  system  predicts  the  results  of  system  calls  when  possible,  which 
allows  speculative  execution  of  the  program.  When  misspeculation  is  detected  after  a  sys¬ 
tem  call  is  completed,  misspeculation  recovery  schemes  are  employed  to  ensure  continuous 
correct  execution  of  both  program  instances. 
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Figure  3.1:  RSFT-Detect  Structural  Overview 

3.3  Automatic  Process  Duplication 


As  shown  in  Figure  3.1,  RSFT-Detect  first  fork  a  new  process  from  itself,  and  starts  exe¬ 
cuting  the  binary  with  its  arguments  in  the  newly  create  process.  Algorithm  1  describes  the 
detailed  steps  of  how  to  force  a  child  process  W1  to  create  a  copy  of  itself  and  gives  up  its 
parent- ship  to  RSFT-Detect. 


Algorithm  1  Non-invasive  Process  Duplication 
1:  W 1  .pid  =  fork() 

2:  copy  original  register  file  of  W1 
3:  Wl. Regs. rax  =  SYS  fork 

4:  //  the  instruction  SYSCALL  is  OF  0  5,  which  takes  2  bytes 
5:  Wl. Regs. rip  =  Wl. Regs,  rip-2 
6:  continue  executing  the  system  call 

7:  //  program  will  re-enter  interrupt  and  execute  system  call  fork 
8:  wait  for  kernel  finishes  executing  SYS. fork 
9:  W2.pid  =  Wl. Regs. rax 
10:  copy  back  original  register  file  of  W2 
11:  continue  executing  Wl  and  W2 
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Process  Stack  Kernel  Stack  Process  Stack  Tracer  Kernel  Stack 


(a)  without  system  call  intercepting  (b)  system  call  intercepting 


Figure  3.2:  System  Call  Intercepting  Mechanism 

3.4  System  Call  Trapping 

RSFT-Detect  detects  transient  faults  at  system  call  interface  level.  The  idea  is  to  intercept 
all  system  calls  initialized  from  two  identical  program  instances.  There  is  a  fixed  set  of 
well-defined  system  calls  provided  by  the  operating  system  to  applications.  For  modern 
Linux  operating  system,  a  total  of  298  system  calls  are  defined.  It  is  reasonable  to  specify 
customized  system  call  arguments  verification  for  each  system  call. 

RSFT-Detect  serves  as  a  thin  layer  between  programs  and  the  underlying  system  and 
hardware  devices.  Figure  3.2  illustrates  the  process  of  system  call  intercepting  and  ar¬ 
gument  verification  process.  Depending  on  the  type  of  the  arguments,  different  checking 
methods  are  used,  including  value  comparison  and  memory  content  comparison.  As  the 
interface  of  system  calls  are  very  well  defined,  all  system  call  arguments  must  be  a  register 
value.  If  an  argument  is  a  pointer,  the  size  of  the  pointed  to  content  must  be  specified  in 
the  system  call  arguments  as  well.  Operating  system  itself  also  leverages  this  knowledge 
to  handle  the  user  space  memory  content.  As  a  result,  it  is  possible  to  manually  inspect  all 
system  calls,  and  write  speculation  code  for  each  system  call  by  hand. 

Despite  the  redundant  execution  of  program  binaries,  side-effecting  work  should  only 
happen  once.  Since  system  calls  have  well  defined  interfaces  and  calling  contexts,  it  is 
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Duplicate 

Speculation 

Example 

Distribution  in  SPEC 

Sync 

No 

No 

sys_read(0),  sys_readv(0),  sys  Jseek(O) 

0.01% 

Asynch 

Yes 

No 

sys  mmap,  sys_mremap,  sys  brk 

19.33% 

No 

Yes 

sysjwrite,  sysjwritev,  sys  _seti timer 

80.66% 

Table  3.1:  Categorized  System  Calls 


possible  to  manually  examine  all  system  calls  and  select  out  the  side-effecting  ones.  For 
example,  a  fprintf  (which  transitively  calls  a  write  system  call)  can  not  be  redun¬ 
dantly  executed.  Based  on  whether  the  results  of  the  system  call  can  be  speculated  and 
whether  the  system  call  needs  to  be  executed  in  both  copies  of  the  program  instances, 
all  system  calls  are  categorized  into  four  categories.  Table  3.1  gives  the  classification  and 
some  examples  from  existing  system  calls.  The  last  column  gives  the  distribution  of  system 
calls  among  these  four  categories,  sampled  from  SPEC  benchmark  programs.  RSFT-Detect 
handles  system  calls  differently  depending  on  which  category  they  belong  to. 


•  Synchronize :  This  category  includes  system  calls  that  have  no  predictable  program 
state  after  system  call  execution.  These  system  calls  cannot  be  executed  redundantly 
(e.g.  sy spread  that  writes  data  to  memory).  Some  read  operations,  such  as  destruc¬ 
tive  read  from  standard  input  or  other  devices,  can  only  be  executed  once.  After  the 
read,  the  content  will  be  destroyed  and  not  be  available  for  a  redundant  read.  RSFT- 
Detect  can  tell  this  kind  of  read  from  the  file  descriptor  argument  of  sys_read  call. 
RSFT-Detect  must  conduct  sys.read  system  call  only  once,  and  copy  the  data  to 
both  program  instances’  private  memory  space  to  ensure  identical  program  state  in 
the  subsequent  execution,  sys  _1  seek  has  the  same  semantics  that  need  to  be  re¬ 
spected. 

•  Asynchronize  duplicate :  Some  system  calls,  such  as  sys_brk,  have  no  predictable 
return  values,  but  its  execution  and  return  values  must  be  duplicated  in  both  pro¬ 
gram  instances  to  ensure  correct  program  execution.  These  system  calls  need  to  have 
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identical  return  values  in  both  program  instances,  but  barrier  synchronization  is  not 
necessary.  RSFT-Detect  handles  this  kind  of  system  calls  by  allowing  the  first  in¬ 
stance  that  invokes  the  call  to  execute  the  call  and  keeps  a  record  of  its  return  values. 
When  the  other  instance  invokes  the  same  call,  RSFT-Detect  compares  their  argu¬ 
ments  for  faults  detection,  then  enforces  the  call  to  do  the  same  work  in  the  second 
instance.  In  the  example  of  sys_t  imes,  the  recorded  return  value  is  returned  to  the 
second  instance. 

•  Asynchronize  single:  This  category  includes  system  calls  with  return  values  that 
can  be  speculated  with  high  confidence,  but  cannot  be  executed  redundantly,  such 
as  sysjwrite.  Among  all  system  calls  invoked  dynamically  on  SPEC  bench¬ 
mark  program,  72.59%  calls  belong  to  this  category.  RSFT-Detect  does  not  require 
barrier  synchronization  for  these  system  calls.  If  one  copy  of  the  process  invokes 
sys  write  ahead  of  the  other  process,  it  can  continue  executing  the  rest  of  the 
code  without  waiting  for  the  system  call  to  complete.  When  the  other  process  reaches 
the  point  of  issuing  sys.write,  it  will  compare  the  arguments  of  this  system  call 
and  the  memory  content  it  is  going  to  write  for  fault  detection.  If  any  fault  occurs, 
RSFT-Detect  is  still  able  to  report  the  fault  to  users. 


3.5  Speculation 

Previous  work  on  transient  fault  detection,  such  as  PLR  proposed  by  Shye  et  al.  [49],  also 
explored  the  method  of  monitoring  program  execution  at  system  call  level  to  find  mis¬ 
matched  arguments.  However,  these  approaches  conservatively  require  both  processes  (the 
original  and  the  redundant)  to  synchronize  at  each  system  call.  Some  system  calls  can  be 
executed  redundantly,  while  some  cannot.  Both  processes  are  allowed  to  continue  execu- 
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tion  only  after  non-redundant  system  call  returns  from  the  kernel,  when  its  return  value  is 
duplicated  and  sent  to  both  processes  for  further  computation.  This  kind  of  barrier  synchro¬ 
nization  adds  considerable  overhead  to  program  execution  time.  Evaluation  in  Chapter  6 
discusses  the  performance  in  details. 

RSFT-Detect,  on  the  other  hand,  does  not  require  barrier  synchronization  between  pro¬ 
cesses  when  the  results  of  the  system  call  can  be  speculated.  For  example,  in  line  C  in 
Figure  2.3,  print  f  invokes  system  call  sys.write,  which  returns  the  number  of  bytes 
transmitted  if  it  succeeds.  RSFT-Detect  always  speculates  the  return  value  to  be  the  num¬ 
ber  of  bytes  supposed  to  be  transmitted  according  to  the  input  parameter.  The  first  program 
instance  that  invokes  this  sys_write  call  is  allowed  to  continue  execution  without  wait¬ 
ing  for  this  system  call  to  complete,  as  shown  in  Figure  2.4.  This  system  call  is  only 
executed  by  the  kernel  when  the  other  program  instance  also  invokes  the  same  call.  This 
high-confidence  value  speculation  allows  one  process  to  proceed  without  waiting  for  the 
other  process  or  the  system  call  to  complete  in  kernel  mode.  A  large  set  of  the  existing 
POSIX  system  calls  such  as  sys.write,  sys_munmap  can  be  speculated  in  this  way. 

3.6  Misspeculation  Detection  and  Recovery 

Misspeculation  occurs  when  the  results  of  a  system  call  differs  from  what  RSFT-Detect 
predicts.  This  means  the  speculatively  executing  program  instance  must  be  discarded  due 
to  a  wrong  prediction  made  earlier.  RSFT-Detect  features  a  fast  misspeculation  detection 
and  recovery  scheme  to  restart  program  execution  from  the  point  of  misspeculation. 

In  order  to  efficiently  record  program  state,  RSFT-Detect  exploits  system  call  forks 
copy-on-write  semantics.  If  the  tracing  process  detects  misspeculation,  it  kills  the  specu¬ 
lative  process  and  duplicate  the  non-speculative  process  for  later  computation.  Copy-on- 
write  semantics  have  previously  been  used  to  implement  speculative  systems  [40].  Com- 
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pared  with  transactional  memory,  this  approach  has  less  speculative  overhead. 

To  inject  a  fork  system  call  into  the  no-speculative  process,  without  rewriting  pro¬ 
gram’s  binary,  RSFT-Detect  follows  the  algorithm  described  in  Algorithm  2.  In  the  fol¬ 
lowing  algorithm,  Wl  refers  to  the  first  invocation  of  the  program  binary,  W2  refers  to  its 
replicate,  and  W  refers  to  either  one  of  the  worker  processes  (Wl  or  W2). 


Algorithm  2  Misspeculation  Detection  and  Recovery 

1:  repeat 

2:  intercepts  system  call  sys  from  a  traced  process  W(either  Wl  or  W2) 

3:  //  Code  for  fault  detection 

4:  if  sys’s  return  value  was  speculated  in  W2  then 

5:  //  symmetric  if  Wl  is  the  speculative  process 

6:  if  Wl. Regs. rax  !=  W2.specvalue[sys]  then 

7:  kill(Wl.pid,  SIGKILL) 

8:  //  duplicate  the  non-speculative  process  W2 

9:  make  a  copy  of  the  register  file  of  W2 

10:  W2.Regs.  rax  =  S  YS-fork 

11:  //  the  instruction  SYSCALL  is  OF  0  5,  which  takes  2  bytes 

12:  W2. Regs. rip  =W2. Regs. rip-2 

13:  continue  executing  the  system  call 

14:  //  program  will  re-enter  interrupt  and  execute  system  call  fork 

15:  wait  for  kernel  finishes  executing  SYS  .fork 

16:  Wl.pid  =  return  value 

17:  copy  back  original  register  file  of  W2 

18:  continue  executing  the  program 

19:  end  if 

20:  end  if 

21:  until  W 1  and  W2  both  exit 


For  the  code  example  in  Figure  2.3,  RSFT-Detect  speculates  the  number  of  bytes  being 
transmitted  by  system  call  sys  write  and  allows  the  process  arriving  first  to  proceed 
without  waiting  for  the  other  process.  When  the  other  process  calls  sys.write,  RSFT- 
Detect  executes  sys.write  and  then  compares  the  return  value  with  the  speculated  value. 
If  the  two  values  are  the  same,  both  processes  proceed  normally.  Otherwise,  misspeculation 
happens.  The  recovery  process  is  demonstrated  in  Figure  3.3.  When  misspeculation  is 
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detected  upon  the  return  of  the  real  system  call  (as  shown  in  Figure  3.3),  RSFT-Detect 
sends  a  signal  to  kill  the  misspeculated  process.  Another  signal  is  sent  to  the  program 
instance  with  correct  values  to  force  a  fork  system  call.  A  new  process  is  then  spawned 
from  Program,  and  continues  the  rest  of  program  execution  redundantly  with  the  correct 
program  state. 


3.7  Virtual  Memory  Space  Synchronization 

To  compare  execution  of  two  processes,  the  original  one  and  its  redundant  copy,  the  fault 
detection  may  have  to  compare  some  virtual  memory  addresses  at  some  system  call.  This 
is  crucial  to  ensure  identical  memory  layout  of  all  processes  to  eliminate  false-positives. 
As  a  result,  some  system  calls,  such  as  sys_mmap,  must  be  executed  in  all  copies  of 
processes  for  a  program  to  continue.  Additionally,  these  two  processes  must  have  identical 
virtual  address  table  to  start  with.  In  systems  with  address  space  randomization  (ASR), 
independence  processes  have  randomized  virtual  address  mapping  even  for  executing  the 
same  program  binary. 

One  solution  to  this  problem  is  to  disable  address  space  randomization  entirely  by  con¬ 
figuring  the  operating  system.  However,  this  also  disables  the  ability  of  other  applications, 
for  example  security  sensitive  applications,  from  benefiting  from  ASR. 

RSFT-Detect  solves  this  problem  by  allowing  the  first  process  that  reaches  sys  mmap 
to  execute  first.  Upon  returning  from  the  system  call,  RSFT-Detect  keeps  a  record  of  the 
virtual  memory  address  that  this  call  allocates  in  this  process.  When  the  other  the  process 
invokes  sys  mmap,  RSFT-Detect  first  compares  their  arguments  for  transient  fault  detec¬ 
tion,  then  forces  this  call  to  map  to  the  same  virtual  address  as  the  first  process,  by  setting 
the  MAP  FORCED  flag.  Because  both  processes  maintain  the  same  memory  layout,  both 
processes  should  be  in  the  same  program  state  before  and  after  the  system  call. 
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App  Tracer  App’ 


Figure  3.3:  Misspeculation  detection  and  recovery  process  in  RSFT-Detect 
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There  are  also  cases  when  sys_mmap  intends  to  map  files  to  a  process’  virtual  memory 
space  and  performs  read  and  write  operations.  By  intercepting  mmap  system  call  and 
checking  its  arguments,  RSFT-Detect  can  identify  such  requests  and  map  that  file  in  RSFT- 
Detect’s  own  memory  space.  RSFT-Detect  then  returns  a  protected  page  address  to  the  two 
program  instances.  When  these  programs  intend  to  access  the  protected  page,  a  signal  is 
sent  to  and  trapped  in  RSFT-Detect.  At  that  point,  RSFT-Detect  checks  for  transient  faults, 
then  performs  the  actual  read  and  write  access  only  once  on  the  memory  mapped  file. 

3.8  Signal  Handling 

In  RSFT-Detect,  signals  sent  to  the  program  and  signals  raised  from  one  copy  of  the  pro¬ 
gram  must  be  handled  in  a  way  that  is  transparent  to  the  user.  Failing  to  handle  these  signals 
will  result  in  either  false  alarms,  or  incorrect  program  behavior.  All  existing  software-only 
transient  fault  tolerance  techniques  either  ignore  these  signals  or  report  them  as  transient 
faults  despite  of  what  they  really  are.  This  section  discusses  two  different  types  of  signals 
and  the  methods  RSFT-Detect  take  to  handle  them. 

Internal  Signal :  Signals  can  be  raised  from  program  itself,  in  the  absence  of  transient 
faults,  either  as  a  program  bug  or  designed  the  program  behavior.  However,  some  transient 
faults  can  cause  a  program  to  raise  an  unexpected  internal  signal.  For  example,  a  particle 
strike  may  flip  a  bit  in  a  register  that  holds  a  memory  address  the  program  loads  from.  Note 
that  the  two  identical  program  instances  do  not  share  memory  space  and  have  different  set 
of  register  files.  Consequently,  a  transient  fault  may  cause  a  segmentation  fault  raised  by 
one  of  the  processes,  but  not  both  of  them.  Similarly,  other  exceptions,  such  as  divide -by¬ 
zero,  may  also  occur  as  a  result  of  a  transient  fault.  To  distinguish  a  transient  fault  from 
a  normal  internal  signal  raised  by  the  program,  such  as  signals  caused  by  program  bugs, 
RSFT-Detect  traps  all  signals  raised  by  both  processes.  If  the  two  processes  both  raise 
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the  same  internal  signal,  it  is  an  expected  signal  and  must  be  handles  in  the  default  way. 
Otherwise,  RSFT-Detect  alerts  the  users  the  existence  of  a  transient  fault. 

External  Signal :  As  a  transparent  transient  fault  detection  technique,  RSFT-Detect 
maintains  the  original  deterministic  behavior  of  the  original  program.  External  signals  may 
cause  non-determinism  among  the  two  copies  of  programs.  For  example,  the  user  may 
press  ctrl-c  from  command  line,  which  sends  a  SIGINT  signal  to  the  program.  RSFT- 
Detect  should  make  sure  the  two  program  copies  behave  as  if  only  one  program  is  running 
and  abort  the  program. 

To  achieve  this,  RSFT-Detect  registers  special  signal  handlers  for  all  external  signals. 
Specialized  signal  handlers  are  registered  at  the  beginning  of  the  program.  When  an  ex¬ 
ternal  signal  is  received,  the  corresponding  signal  handler  is  called  and  proper  actions  are 
taken.  For  example,  in  the  case  of  a  SIGINT  sending  to  the  program  from  command  line, 
RSFT-Detect  communicates  SIGINT  to  both  processes,  terminates  their  execution  and  kills 
itself  as  well. 

There  are  cases  where  one  fault  process  may  bypass  the  signal  raising  code.  RSFT 
features  a  timeout  watchdog  process,  which  will  raise  timeout  signal  if  the  program  has 
not  made  progress  after  a  pre-defined  threshold.  RSFT  will  handle  this  case  as  a  transient 
fault.  This  threshold  is  typically  application  specific.  In  the  evalution  of  this  dissertation, 
the  threshold  was  set  to  ten  times  of  a  typical  run’s  execution  time. 

3.9  Optional  Memory  Page  Walking 

To  protect  program  execution  from  transient  faults  in  memory  sub-systems,  RSFT-Detect 
adopted  process-level  duplication  and  use  redundant  memory  space.  However,  two  pro¬ 
cesses,  one  forked  from  another,  share  the  same  physical  memory  page,  if  no  value  is 
written  to  that  page.  As  a  result,  memory  faults  occurring  in  pages  that  are  shared  by  the 
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two  processes  can  be  read  by  both  program  copies  hence  introducing  faulty  values.  This 
means  the  two  copies  will  produce  the  same  faulty  results,  which  may  cause  the  system  to 
fail  severely. 

RSFT-Detect  solves  this  problem,  by  optionally  turning  on  a  knob  to  perform  automatic 
memory  page  walking.  The  idea  is  to  check  the  two  processes’  physical  page  table  periodi¬ 
cally.  If  any  of  the  pages  that  are  physically  in  memory  has  the  same  physical  page  address 
for  both  processes,  RSFT-Detect  automatically  loads  a  word  from  the  page,  and  immedi¬ 
ately  writes  the  same  value  back,  utilizing  the  copy-on-write  feature  of  forked  processes. 
The  operating  system  will  automatically  create  another  physical  page  as  a  tainted  copy. 

There  is  a  trade-off  between  memory  consumption  and  memory  fault  coverage.  By 
enabling  memory  page  walking,  there  is  a  better  chance  for  RSFT-Detect  to  detect  memory 
transient  faults,  at  the  cost  of  using  more  physical  memory  during  redundant  execution. 
Disabling  this  memory  page  walking  will  reduce  the  size  of  active  physical  memory  usage, 
but  results  in  larger  window  of  vulnerability  in  practice. 

3.10  Transient  Fault  Detection 

A  combination  of  the  above  techniques  gives  RSFT-Detect  the  ability  to  detect  a  wide  range 
of  transient  faults  in  both  processor  and  memory  subsystems.  Figure  3.10  lists  several 
results  of  transient  faults,  and  how  RSFT-Detect  handles  each  of  them. 

•  Register  Value.  A  transient  fault  may  alter  a  register  value  that  never  flows  into  the 
following  computation  of  the  program.  These  faults  can  be  safely  ignorely.  If  the 
altered  register  value  directly  changes  the  values  that  escapes  the  SoR,  these  faults 
can  be  detected  via  value  comparion  in  RSFT  fault  detection.  If  a  fault  changes  an 
address  of  a  memory  access  operation,  the  program  may  trigger  a  segmentation  fault. 
If  a  segamentation  fault  is  found  in  only  one  of  the  work  processes,  this  asymmetric 
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behavior  is  identified  as  a  transient  fault.  Otherwise,  transient  fault  detection  will 
not  catch  this  fault  when  it  occurs.  Instead,  this  fault  will  be  caught  in  later  fault 
detection  via  value  comparison. 

•  Memory  address.  If  a  memory  content  is  modified  by  a  transient  fault,  this  value  can 
be  either  instruction  or  data.  In  case  the  value  is  an  instruction,  and  will  be  executed, 
the  two  worker  processes  have  high  probability  of  executing  different  binary  and 
end  up  sending  different  values  out  of  SoR.  This  kind  of  faults  is  captured  by  RSFT. 
Otherwise,  this  altered  instruction  can  be  safely  ignored.  If  the  memory  content  holds 
a  value  to  be  loaded  to  the  program,  RSFT  detects  such  fault  like  register  values 
discussed  above.  Otherwise,  this  fault  will  not  change  the  program  behavior,  hence 
not  the  target  of  transient  fault  detection. 
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3.11  Window  of  Vulnerability 


A  fault  occurring  in  RSFT-Detect  itself  may  cause  an  unrecoverable  error  or  erroneous 
results  or  undefined  program  behavior.  Additionally,  RSFT-Detect  does  not  protect  the 
operating  system  and  its  services  executed  in  kernel.  Transient  faults  occurring  in  the 
kernel  code  may  still  cause  program  failure. 

Although  RSFT-Detect  itself  represents  a  single  point  of  failure,  it  only  occupies  the 
CPU  for  a  very  short  duration  and  only  consumes  a  few  clock  cycles.  Typically  the  RSFT- 
Detect  process  only  takes  less  than  0.01%  CPU  time  throughout  program  execution.  The 
probability  of  transient  faults  occurring  registers  while  RSFT-Detect  code  is  running  is 
extremely  low. 

The  memory  consumption  of  RSFT-Detect  is  9.5MB  peak.  If  a  transient  fault  hits  the 
main  memory  that  holds  data  used  in  the  tracing  process,  it  may  lead  to  program  crash  or 
wrong  output.  However,  such  faults  may  still  be  benign  faults  or  be  detected  via  transient 
fault  detection.  For  example,  if  the  fault  flips  a  bit  of  a  word  in  memory,  which  is  over¬ 
written  before  the  value  is  loaded,  the  fault  is  a  benign  fault.  If  a  memory  fault  changes  the 
value  of  a  word  in  memory  which  results  in  a  value  mismatch  in  fault  detection  code,  it  can 
also  be  detected. 

Chapter  6  evaluates  RSFT-Detect’s  window  of  vulnerability  by  simulating  transient 
faults  in  both  register  files  and  main  memory. 
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Chapter  4 


RSFT  Transient  Fault  Recovery 


The  previous  chapters  presented  transient  faults  detection  at  program  runtime  using  pro¬ 
gram  instance  duplication  and  runtime  checking.  However,  detecting  a  transient  fault  is 
only  the  first  step  of  fault  tolerance.  Detecting  a  fault  only  prevents  program  from  cor¬ 
rupting  outside  visible  program  state  and  enforces  data  integrity.  Detection  alone  does  not 
guarantee  un-interrupted  correct  program  execution  in  the  presence  of  transient  faults.  In 
an  environment  where  transient  faults  occur  frequently,  users  may  never  have  a  single  run 
without  transient  faults  corrupting  critical  values.  A  system  must  be  able  to  recovery  from 
a  fault  to  be  truly  fault  resilient. 

This  chapter  describes  the  transient  fault  recovery  system  of  RSFT,  called  RSFT- Recover 
.  RSFT-Recover  is  a  novel  program  execution  checkpointing  and  recovery  system  that  per¬ 
forms  cheap  and  fast  program  checkpointing,  program  state  verification  and  fast  program 
state  recovery. 

This  chapter  first  introduces  the  background  of  transient  fault  recovery.  Section  4.2 
discusses  several  common  fault  recovery  mechanisms  proposed  previously  and  their  limi¬ 
tations  to  motivate  the  RSFT-Recover.  Section  4.3  illustrates  the  overall  design  of  RSFT- 
Recover  .  The  following  sections  introduces  the  implementation  details  of  RSFT-Recover  . 
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Section  4.5  discusses  RSFT-Recover’s  window  of  vulnerability  and  compare  it  with  exist¬ 
ing  fault  recovery  proposals. 


4.1  Transient  Fault  Recovery 

After  a  transient  fault  is  detected,  the  program  may  behave  differently  depending  on  the 
result  of  the  transient  faults. 

•  Benign  Faults.  These  transient  faults  alter  values  held  in  registers  or  memory,  but 
does  not  propogate  to  the  output  or  devices.  The  program  can  continue  executing 
and  produce  correct  output  after  a  fault  has  occurred. 

•  Non-benign  faults.  Program  execution  hit  by  a  transient  fault  may  produce  erroneous 
values  to  the  outside  world.  In  such  cases,  the  program  either  cannot  complete  exe¬ 
cution,  or  generates  wrong  output.  When  these  faults  are  detected,  a  transient  fault 
recovery  is  necessary  for  the  program  to  finish  correctly. 

RSFT  operates  as  a  layer  between  applications  and  the  underlying  operating  system. 
As  a  result,  transient  faults  detected  by  RSFT  are  always  non-benign  faults.  To  complete 
program  execution  correctly  and  fully  tolerant  the  transient  faults,  effective  fault  recovery 
is  necessary  for  programs  protected  by  RSFT. 

Among  the  previous  related  work  discussed  in  Chapter  2,  only  a  small  number  of  tran¬ 
sient  fault  tolerance  techniques  cover  transient  fault  recovery  after  a  fault  is  detected.  The 
following  section  discusses  several  representative  transient  fault  recovery  solutions,  their 
advantages  and  disadvantages.  The  transient  fault  recovery  scheme  is  RSFT  is  introduced 
in  later  sections. 

Previous  transient  fault  detection  techniques,  such  as  SRT  [29],  requires  frequent  check¬ 
pointing  to  recover  from  faulty  states  to  safe  states. 
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4.2  Existing  Techniques  and  Limitations 


4.2.1  Program  Re-execution 

The  program  execution  is  stopped  after  detecting  fault,  and  is  restarted  from  the  beginning, 
hoping  to  have  a  fault-free  execution  in  the  second  run.  This  approach  has  several  problems 
itself. 

•  The  program  may  have  already  resulted  in  outside-visible  effects,  such  as  sending 
network  packages.  Re-executing  from  the  very  beginning  of  the  program  results 
in  recomputing  correct  results  computed  previous  to  the  fault,  hence  more  resource 
consumption. 

•  The  probability  of  having  a  transient  fault  in  the  second  run  is  independent  of  the 
previous  run.  Hence  it  is  also  possible  to  have  a  soft  error  in  the  second  run  as  well. 
Continuously  stopping  and  re-executing  a  program,  especially  for  long-run  scientific 
programs  that  may  take  up  to  weeks  or  months,  is  time-consuming  and  not  efficient. 

4.2.2  K-Modular  Redundancy 

Some  existing  techniques  uses  or  triple-modular  redundancy(TMR)  in  either  hardware, 
such  as  Boeing  airplanes  [65]  or  software[42],  PLR  [49]  to  recovery  from  a  transient  fault. 
TMR  means  that  either  three  identical  program/hardware  copies,  or  one  main  code  and 
two  redundant  copies  are  computed  simultaneously.  This  allows  the  fault  to  corrupt  one 
execution  at  a  time,  as  long  as  the  other  two  can  agree  on  the  critical  values.  By  using 
this  scheme,  any  single-even-upset  can  be  corrected  as  two  of  the  versions  still  have  the 
correct  data.  After  a  single-event  fault  is  detected,  the  corrupted  values  are  ignored  and 
corrected.  If  any  independent  memory  image  is  corrupted,  the  particular  program  instance 
is  also  discarded. 
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As  a  natural  extension  of  triple-modular  redundancy,  some  fault  recovery  techniques 
employ  ^-modular  redundancy  and  majority  voting  to  correct  program  execution  from  the 
failure  point  one,  in  the  presence  of  multiple  transient  faults.  These  recovery  techniques 
typically  use  more  than  three  identical  execution  contexts,  and  compare  the  values  that 
escape  the  SoR  to  detect  transient  faults.  If  any  of  the  values  mismatch  with  the  majority 
of  other  copies,  the  value  is  considered  corrupted  and  a  transient  fault  is  detected. 

To  recover  from  the  fault,  the  identical  majority  values  are  considered  correct  and  used 
in  later  computation.  The  corrupted  value  is  thrown  away  and  that  execution  context  is 
also  discarded.  A  system  with  majority- voting  fault  recovery  can  survive  k  faults  is  called 
K-fault  tolerant.  Such  a  system  requires  at  least  k+1  processes  to  tolerance  k  faults.  In 
situations  where  faults  occur  more  than  once  during  program  execution,  the  cost  of  majority 
voting  recovery  can  be  very  high  both  in  term  of  performance  and  power  consumption. 

Some  techniques  can  spawn  another  execution  context  with  correct  values  from  the 
failure  point  on,  to  maintain  k  active  program  instances  at  any  time.  This  solution  may 
be  effective  under  single-event-upset  model.  However,  if  transient  faults  occur  more  fre¬ 
quently  than  expected,  it  is  possible  several  active  copies  contain  different  corrupted  values 
at  the  same  time.  Majority  voting  may  not  get  a  majority  vote.  Moreover,  some  program 
instances  may  present  identical  values  like  other  copies,  but  may  contain  other  corrupted 
values  in  memory.  Duplicating  another  instance  from  these  instances  may  lead  to  future 
program  failure. 

In  addition,  using  more  than  three  processes  for  fault  recovery  consumes  more  com¬ 
putation  resources,  such  as  CPU  time  and  memory,  than  using  two  processes.  Power  con¬ 
sumption  is  also  increased  with  the  increase  of  execution  contexts.  This  is  an  effective  yet 
not  ideal  way  to  recovery  from  transient  faults. 
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4.2.3  Checkpointing 

Other  techniques  [14,  38,  39,  52]  keeps  track  of  correct  program  states  via  checkpointing 
program  execution.  These  checkpoints  can  be  used  later  if  any  transient  fault  is  detected 
and  program  needs  to  roll  back  to  a  previous  correct  version. 

In  practice,  the  program,  or  another  checkpointing  process,  creates  checkpoints  peri¬ 
odically,  either  in  memory  or  some  external  storage  device.  When  a  fault  is  detected,  the 
program  rolls  back  and  resumes  execution  from  the  last  program  checkpoint.  This  approach 
does  not  have  the  problem  of  keep  hitting  a  fault  during  execution.  However,  it  still  suffers 
from  the  problem  that  some  critical  values  may  have  escaped  SoR  between  checkpoint  and 
the  point  a  fault  is  detected.  One  solution  is  to  checkpoint  more  frequently,  ideally  before  or 
after  every  instruction  that  may  cause  outside-visible  behavior.  However,  frequent  check¬ 
pointing  slows  down  program  significantly.  The  evaluation  demonstrated  in  Chapter  6  will 
discuss  this  problem  further. 

To  solve  this  problem,  specialized  hardware  modules  were  proposed  to  perform  check¬ 
pointing  to  maintain  correct  hardware  state.  Additional  10  operations  bookkeeping  are 
necessary  for  fault  recovery  using  checkpointing. 

In  addition,  software  runtime  checkpointing  typically  introduces  significant  perfor¬ 
mance  overhead.  For  example,  some  existing  checkpointing  algorithms  stalls  program 
execution,  make  a  record  of  all  register  files  and  memory  state,  keeps  the  record  in  either 
memory  or  disk  files,  then  resume  program  execution. 

4.3  RSFT-Recover 

This  dissertation  proposes  a  runtime  lightweight  checkpointing  and  fault  recovery  tech¬ 
nique  for  transient  fault  recovery.  This  technique  provides  automatic  program  checkpoint- 
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ing,  outside-visible  behavior  bookkeeping,  and  program  re-execution  from  last  verified 
checkpoint. 

Compared  with  previous  proposed  fault  recovery  techniques  discussed  above,  RSFT- 
Recover  does  not  need  to  re-execute  the  whole  program  from  the  beginning,  hence  reducing 
the  probability  of  transient  fault  occurring  during  program  execution. 

4.3.1  Structural  View 

Like  the  fault  detection  discussed  in  the  previous  chapter  3,  RSFT-Recover  automatically 
duplicates  the  program  binary  execution  into  two  copies.  Current  security  policies  of 
modern  operating  systems  only  allow  a  process’  parent(s)  to  trace  and  intercept  system 
calls.  Therefore,  slightly  different  from  the  implementation  of  fault  detection  only,  RSFT- 
Recover  first  creates  a  process  for  conducting  the  process  (T)  tracing  and  fault  detection. 
Process  T  will  create  a  process  (Wl)  to  invoke  the  program  binary  and  duplicate  W1  in 
another  process  W2  for  redundant  execution.  Figure  4.1  illustrates  the  overall  structure  of 
fault  recovery. 

4.3.2  Runtime  non-invasive  checkpointing 

To  restart  a  program  from  a  correct  previous  version  after  a  fault  is  detected,  the  pro¬ 
gram  needs  to  create  checkpoints  from  time  to  time.  The  runtime  checkpointing  of  RSFT- 
Recover  is  implemented  via  duplicating  a  process’  state,  including  register  files  and  mem¬ 
ory  content.  The  duplication  process  utilize  the  fork  system  call  provided  by  operating 
systems,  fork  system  call  is  designed  to  duplicate  a  process,  with  all  its  memory  pages 
marked  copy-on-write(COW).  The  two  identical  program  instances,  namely  Wl  and  W2, 
each  create  a  checkpoint  using  fork  system  call  independently,  but  at  the  exact  same  pro¬ 
gram  point. 
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Figure  4.1:  RSFT-Recover  Structure  Overview.  Dashed  line  indicates  forking  a  child  pro¬ 
cess  then  transfers  tracing  control  to  other  processes. 
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The  program  point  RSFT-Recover  choose  to  create  a  checkpoint  is  predefined  as  an 
environmental  variable  in  term  of  number  of  system  calls  (NSYS)  in  the  program.  RSFT- 
Recover  checks  the  value  of  the  variable  at  program  runtime  and  create  checkpoints  every 
NSYS  system  calls. 

To  invoke  a  fork  system  call  in  W1  and  W2,  RSFT-Recover  need  to  either  modify  the 
binary  of  W1  and  W2,  or  inject  a  system  call  in  a  non-invasive  way.  In  the  implementation 
proposed  in  this  dissertation,  RSFT-Recover  inject  a  fork  system  call  using  Algorithm  3. 

4.3.3  Background  Process  Image  Verification 

After  a  checkpoint  process  is  created,  it  is  trapped  by  the  tracer  process  (T).  T  communi¬ 
cates  its  PID  to  the  background  scrubbing  process  through  software  communication  queue 
described  in  [19].  The  scrubbing  process  consumes  the  PIDs  and  performs  background 
memory  content  and  register  value  verification.  Algorithm  4  describes  the  process  of  mem¬ 
ory  scrubbing  after  a  checkpoint  is  created. 

4.3.4  Program  State  Bookkeeping 

One  common  problem  of  previous  recovery  techniques  is  that  re-executing  the  program, 
either  from  the  beginning  or  from  a  checkpoint,  involves  re-issuing  some  instructions  that 
has  side-effects.  For  example,  a  program  may  send  signals  to  another  process,  or  send  a 
package  over  the  network  to  another  system,  or  transfer  values  to  hardware  devices,  after 
a  verified  program  checkpoint.  Most  of  these  cases  are  non-reversible.  After  a  program 
recovers  from  a  checkpoint,  if  the  program  executes  these  side-effecting  instructions  again, 
the  program  may  behave  incorrectly  from  the  outside  world’s  view. 

To  solve  this  problem,  RSFT-Recover  keeps  track  of  all  program  behavior  that  may  be 
visible  outside  of  its  SoR.  The  nature  that  RSFT  operates  at  the  system  calls  granularity 
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Algorithm  3  Automatic  Runtime  Checkpointing 
1:  repeat 

2:  intercepts  system  call  sys  from  a  traced  process  W(either  W1  or  W2) 

3:  775V5++ 

4:  if  nsys  ==  NSYS  then 

5:  //  create  a  checkpoint 

6:  if  process  W  is  entering  a  system  call  then 

7:  make  a  copy  of  the  register  file  of  W 

8:  W.Regs.orig_rax  =  SYS  fork 

9:  continue  executing  the  system  call 

10:  wait  for  W  to  return  from  SYS  fork 

1 1 :  NewPID  =  W.  Regs,  rax 

12:  enqueue(NewPID) 

13:  copy  back  original  register  file  of  W 

14:  W.  Regs. rax  =  sys 

15:  //  the  instruction  SYSCALL  is  OF  0  5,  which  takes  2  bytes 

16:  rip  =  rip -2 

17:  continue  executing  W 

18:  //  program  will  re-enter  interrupt  and  execute  the  original  system  call  sys 

19:  else 

20:  make  a  copy  of  the  register  file  of  W 

2 1 :  Regs,  rax  =  SYS  fork 

22:  II  the  instruction  SYSCALL  is  OF  0  5,  which  takes  2  bytes 

23:  rip  =  rip-2 

24:  continue  executing  the  system  call  in  kernel 

25:  //  program  will  re-enter  interrupt  and  execute  system  call  fork 

26:  wait  until  the  system  call  returns  from  SYS  fork 

27:  NewPID  =  W.  Regs,  rax  II  return  value 

28:  c  nq  uc  ucfiVc  wPID) 

29:  copy  back  original  register  file  of  W 

30:  continue  executing  W 

31:  end  if 

32:  else 

33:  perform  fault  detection 

34:  end  if 

35:  until  W 1  and  W2  exit 
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Algorithm  4  Background  Memory  Scrubbing 

1:  repeat 

2:  newPIDl  =  dequeue(); 

3:  newPIDl  =  dequeue(); 

4:  openMemoryMapFilefncwP/Df ,  newPID2 

5:  ope n Me m ory F i le(/? e wPID / ,  newPID2) 

6:  for  all  memory  page  P  in  memory  of  newPIDl  and  newPID2  do 

7:  touch  Memory  Pagcs(F) 

8:  read  page  P  from  memory  of  newPIDl ,  newPID2 

9:  compare  memory  content  from  two  versions 

10:  if  memory  contents  mismatch  then 

11:  discard  this  checkpoint 

12:  else 

13:  keep  this  checkpoint 

14:  discard  the  previous  checkpoint 

15:  end  if 

16:  end  for 

17:  until  signaled  to  exit 


allows  RSFT  -recovery  to  bookkeep  the  program  states  in  term  of  system  calls.  After  a 
program  checkpoint  is  made,  RSFT-Recover  keeps  track  of  the  number  of  system  calls 
issued  from  each  program  instance.  Note  that  RSFT  compares  the  arguments  of  system 
calls  at  each  call  site,  the  difference  of  number  of  system  calls  issues  from  the  two  program 
instances  are  less  or  equal  to  one.  Once  a  transient  fault  occurrence  is  confirmed,  RSFT- 
Recover  tries  to  recover  program  from  the  last  checkpoint.  After  the  program  restarts  from 
the  last  verified  checkpoint,  RSFT-Recover  is  able  to  replay  the  program  execution  without 
re-issuing  instructions  that  cause  outside-visible  behaviors. 


4.4  Runtime  Fault  Recovery 

When  a  transient  fault  is  detected  during  program  execution,  RSFT-Recover  picks  up  a 
previous  correct  checkpoint  and  resume  program  execution  from  the  checkpoint  on. 
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4.4.1  Checkpoint  Process  Resuming 

The  checkpoint  processes  are  duplicates  of  the  original  working  processes  W1  and  W2. 
They  are  stopped  by  the  scrubbing  process  for  memory  content  comparison.  After  the 
process  image  comparison  is  successful,  they  are  still  stopped  at  exactly  the  same  program 
point  and  not  allowed  to  continue  execution. 

Program  execution  recovery  is  two  simple  steps,  as  illustrated  in  Figure  4.2: 

1.  Kill  the  existing  two  working  processes. 

2.  Signal  the  last  verified  checkpoint  processes.  The  checkpoint  processes  are  woken 
up  by  signals  to  continue  executing  the  rest  of  the  program. 


4.4.2  System  Call  Replay 

After  the  program  resumes  from  a  previous  correct  checkpoint  image,  RSFT- Recover  keeps 
track  of  the  number  of  critical  system  calls  such  as  read,  write.  This  number  is  com¬ 
pared  with  the  number  of  correctly  executed  system  calls  after  the  checkpoint.  If  the  num¬ 
ber  is  less  than  the  previously  successfully  committed  system  call  number,  this  system  call 
is  bypassed.  RSFT  does  not  allow  this  system  call  to  flow  into  the  operating  system  or 
causing  any  outside  visible  behavior. 

Note  that  some  system  calls,  such  as  mmap,  mremap,  are  still  performed  to  ensure 
future  correct  program  execution.  Only  those  that  will  send  values  to  outside  world  will  be 
emulated  and  not  replayed. 

However,  there  are  some  cases  that  this  replay  may  violate  program  semantics.  For 
example,  if  an  application  is  supposed  to  send  a  heartbeat  network  message  at  a  certain 
frequency,  skipping  such  system  calls  may  change  how  the  outside  world  sees  this  applica¬ 
tion.  While  this  dissertation  does  not  handle  this  case,  it  is  possible  to  either  notify  users 
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Figure  4.2:  Transient  fault  recovery  timeline  in  RSFT 
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to  ignore  temporary  out-of-sync  nodes  or  modify  the  receiver  of  such  network  messages  to 
be  aware  of  the  temporary  service  outrage. 


4.5  Window  of  Vulnerability 

RSFT-Recover  has  a  small  window  of  vulnerability.  First  of  all,  the  recovery  process  itself 
is  a  single  point  of  failure.  After  a  transient  fault  is  detected,  RSFT-Recover  wakes  up  the 
previous  checkpoint  and  tries  to  restart  program  execution  from  the  last  correct  program 
state.  However,  if  another  transient  fault  occurs  during  the  recovering  process,  the  program 
may  (1)  fail  to  restart,  or  (2)  restart  with  a  faulty  program  state.  In  the  former  case,  the 
timeout  mechanism  will  ask  the  program  to  restart  from  the  very  beginning.  Although  this 
is  a  slower  than  usual  recovery,  it  still  manage  to  recover  program  execution  transparent  to 
the  user.  In  the  latter  case,  the  faulty  program  instances  may  either  result  in  RSFT  detecting 
the  fault  later  via  value  verification,  or  incorrect  program  output. 

Given  the  potential  occurrence  rate  of  transient  faults  in  common  environments,  it  is 
extremely  unlikely  that  the  recovering  process  is  hit  while  performing  the  recovery.  But  in 
severe  environments  with  high  radiation  rate,  this  is  a  window  of  vulnerability  for  RSFT 
protected  programs. 

Another  vulnerable  window  of  RSFT-Recover  is  the  verification  process.  During  mem¬ 
ory  verification,  if  the  memory  content  of  the  checkpoint  processes,  or  the  verification 
result  (match  or  mismatch)  is  corrupted  at  runtime  due  to  a  transient  fault,  RSFT-Recover 
may  incorrectly  conclude  that  one  of  the  working  processes  is  affected  by  a  transient  fault. 
However,  RSFT-Recover  will  not  raise  a  false  positive  alarm  of  transient  fault  in  such  a 
case.  Instead,  RSFT-Recover  will  discard  this  version  of  checkpoint  and  hoping  to  get  a 
correct  checkpoint  next  time. 

RSFT-Recover  also  registers  signal  handlers  for  the  background  process.  If  the  back- 
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ground  verification  process  itself  suffers  from  a  transient  fault,  which  results  in  a  process 
crash,  RSFT-Recover  will  trap  the  signal,  and  create  a  new  process  to  continue  perform 
scrubbing  from  the  failure  point  on.  However,  all  previous  checkpoint  information  will  be 
lost. 

If  any  memory  content  of  the  checkpoint  is  corrupted  before  it  is  used  for  recovery,  it 
may  lead  to  future  program  failure.  However,  because  these  two  checkpoint  processes  are 
hanging  there,  waiting  to  be  waken  up.  The  operating  system  will  schedule  them  off  the 
processors  and  memory  system  most  of  the  time,  the  window  of  being  hit  by  a  transient 
fault  is  only  during  process  image  comparison,  which  is  extremely  low  compared  with  the 
other  processes  or  the  whole  program  execution. 

Note  that  RSFT-Recover  does  not  introduce  false  positives  to  transient  fault  detection, 
since  it  is  independent  from  transient  fault  detection  RSFT-Detect.  RSFT-Detect  can  work 
with  other  recovery  mechanism  as  well  in  a  plug-and-play  fashion. 
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Chapter  5 


Evaluation  Methodology 


This  chapter  describes  the  methodology  used  to  evaluate  the  reliability  of  RSFT  presented 
in  this  dissertation.  Section  5.1  discusses  the  standard  transient  fault  model,  and  a  more  re¬ 
alistic  fault  model  used  for  evaluation  in  this  dissertation.  Section  5.2  explains  the  metrics 
used  to  measure  reliability  of  RSFT.  Section  5.3  details  the  simulation  details  and  evalua¬ 
tion  method  used  to  measure  the  metrics. 

Section  5.4  describes  the  evaluation  environment  in  this  dissertation.  And  Section  5.5 
presents  an  analysis  of  the  benchmark  characteristics.  Section  5.6  inroduces  the  methods 
used  in  evaluation  on  a  set  of  23  for  reliability,  performance,  and  resource  consumption. 

5.1  Transient  Fault  Model 

In  recent  fault  tolerance  research,  Single  Event  Upset(SEU)  model  is  commonly  adopted  to 
evaluate  the  reliability  of  techniques.  This  model  assumes  that  one  and  only  one  transient 
fault  occurs  during  one  program  execution,  and  this  fault  flips  exactly  one  bit.  The  SEU 
itself  is  not  considered  permanently  damaging  to  the  transistor’s  functionality. 

Although  SEU  model  is  commonly  adopted,  it  is  also  beneficial  to  study  the  reliability 
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of  fault  tolerance  under  the  Multiple  Event  Upset  (MEU)  model.  The  MEU  model  means 
that  there  can  be  more  than  fault  occurring  per  program  execution.  These  faults  may  flip 
exactly  one  bit,  or  flip  multiple  bits  at  a  time.  In  reality,  radiation  may  flip  a  bit  and  one  or 
more  bits  that  are  physically  close  to  the  flipped  bit.  Compared  with  the  SEU  model,  MEU 
is  more  realistic  and  is  useful  to  study  the  fault  tolerance  techniques  when  fault  frequency 
is  higher  than  expected.  ECC  memory  can  protect  memory  content  against  transient  faults, 
but  limited  to  SEU  only.  In  cases  where  multiple  bit  are  flipped  within  an  ECC  protection 
unit  (e.g.  a  word),  ECC  is  more  likely  to  fail. 

To  compare  with  existing  technique,  we  did  experiments  with  MEU  model.  For  each 
fault  model,  the  simulation  was  done  3,000  times  on  each  benchmark  program.  The 
evaluation  is  implemented  using  a  processing  tracing  tool  (ptrace  on  Linux)  to  simu¬ 
late  transient  faults  at  runtime  in  both  register  files  and  memory  space,  in  a  way  similar 
to  [20,  50,  21]).  Two  bits  of  the  same  word  are  flipped  in  the  program’s  register  files  or 
physical  memory  space. 

5.2  Reliability  Metrics 

Transient  faults  may  have  different  effects  on  program  execution.  Mukherjee  et  al.  classify 
transient  faults  by  these  effects  into  the  following  categories  [31]: 

•  Benign  Fault.  A  transient  fault  that  does  not  affect  the  outcome  of  the  program  is 
a  benign  fault.  Wang  et  al.  report  that  85%  of  total  transient  faults  are  benign,  and 
result  in  no  externally  visible  program  errors  [61]. 

•  Silent  Data  Corruption.  If  a  fault  induces  the  system  to  generate  erroneous  program 
outputs,  it  is  said  to  have  suffered  silent  data  corruption.  Silent  data  corruption,  or 
SDC,  is  what  all  fault  detection  techniques  are  designed  to  detect  and  what  fault 


52 


recovery  mechanisms  are  designed  to  recover  from. 


•  Detected  Unrecoverable  Error.  A  detection-only  protection  mechanism  does  not 
recover  the  program  from  a  fault,  but  prevents  SDC  through  fail-stop  mechanism, 
thereby  avoiding  any  data  corruption.  An  error  of  this  kind  is  called  Detected  Unre¬ 
coverable  Error,  or  DUE. 

The  target  of  this  dissertation  is  to  detect  and  recover  from  the  faults  that  cause  SDCs. 
RSFT  benefits  from  the  insight  that  minimizing  the  cost  of  detecting  benign  faults  can 
improve  program  performance  by  a  large  amount. 

The  average  Mean-Time-To-Failure  (MTTF)  time  is  a  widely  accepted  metrics  to  eval¬ 
uate  the  reliability  of  a  fault  tolerance  technique.  At  the  time  of  a  fault  injection,  a  times¬ 
tamp  is  recorded.  If  the  fault  does  not  lead  to  any  failure  or  detected  fault,  this  timestamp 
is  discarded.  Otherwise,  another  timestamp  is  recorded  at  the  time  of  fault  detection  or 
program  failure.  The  difference  between  these  two  timestamps  is  called  Mean-Time-To- 
Failure  (MTTF).  The  distribution  of  MTTF  out  of  3000  runs  of  each  benchmark  program 
is  demonstrated  in  the  evaluation  section. 

Mean  Time  to  Failure  does  not  describe  the  reliability  of  a  fault  tolerance  technique 
entirely.  If  one  technique  enables  the  program  to  execute  twice  as  faster  as  the  program 
protected  by  another  technique,  the  second  program,  since  taking  longer  to  run,  is  twice 
more  likely  to  be  hit  by  transient  faults.  Some  previous  work  used  other  metrics,  such 
as  Mean  Instruction  To  Failure  (MITF)  and  Mean  Work  To  Failure  (MWTF)  [44].  These 
methods  are  useful  when  all  instruction  are  equally  important  and  takes  roughly  the  same 
time  to  execute,  or  when  the  definition  of  work  is  the  same  across  all  techniques  in  com¬ 
parison. 

This  dissertation  adopts  the  metrics  of  MTTF.  Typically,  the  faster  a  program  executes, 
the  less  likely  it  execution  is  affected  by  transient  faults,  which  is  correlated  with  MTTF. 
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Reason 

Symptom 

Modelling 

Instruction 

instruction  corruption 

Modify  the  value  in  a  random  memory  address  that  maps  to  the  program  binary 

Physical  Memory 

instruction  corruption  or  data  corruption 

Modify  random  memory  address  that  may  contain  instruction  or  data 

Memory  Bus 

instruction  fetch  error  or  data  corruption 

Modify  random  memory  address  that  may  contain  instruction  or  data 

Register  File 

data  corruption 

Flip  bits  in  a  random  general-purpose,  XMM,  floating  point,  or  register 

Program  Counter 

incorrect  program  flow 

Flip  bits  in  the  PC  register 

Control  Logic 

PC  value  corruption 

Bit  flip  in  PC  register 

Re-order  buffer 

data  corruption 

Bit  flip  in  a  random  register  or  memory  content 

load  store  queue 

data  corruption 

Bit  flip  in  a  random  register  or  memory  content 

Table  5.1:  Transient  fault  types  and  modeling 

Given  that  the  goal  is  to  protect  program  execution  from  transient  faults,  it  is  reasonable  to 
simply  measure  MTTF  as  the  simulation  is  injection  a  constant  number  of  faults  per  run, 
not  per  time  unit. 

5.3  Transient  Fault  Simulation 

This  dissertation  considers  transient  faults  in  register  files,  physical  memory,  and  control- 
logic.  Table  5.1  shows  various  types  of  transient  faults  and  the  way  they  are  modelled  in 
this  dissertation. 

All  reliability  evaluation  demonstrated  in  this  dissertation  injects  faults  into  all  three 
processes^ two  worker  and  one  tracer).  This  fault  injection  method  does  not  inject  faults 
into  the  operating  system  itself.  However,  the  time  spent  in  the  tracer  process  includes  the 
time  that  all  system  calls  spent  in  the  operating  system.  By  injecting  faults  into  the  tracer 
process  and  change  its  register  or  memory  values,  some  faults  in  the  operating  system  are 
also  simulated. 

5.3.1  Register  Fault  Simulation 

Many  prior  work  uses  PIN  dynamic  instrumentation  to  inject  faults  into  register  files.  This 
thesis  uses  another  system-level  approach.  First,  a  profile  run  of  the  original  program 
binary  is  timed  to  estimate  how  long  it  may  take  to  execute  the  program.  Before  fault 
injection,  our  tool  randomly  selects  one  point  in  time,  one  random  program  instance,  one 
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random  bit  of  a  register,  as  well  as  one  random  register  among  general-purpose,  floating 
point,  XMM,  and  flag  registers.  During  program  runtime,  this  fault  injection  simulation 
issues  an  alarm  after  a  random  period  of  time.  It  then  sends  a  signal  to  the  randomly 
selected  process,  stops  its  execution,  and  flips  the  random  bit  of  the  selected  register.  The 
particular  program  instance  then  continues  execution.  Finally,  the  execution  result  of  the 
fault-injected  program  is  compared  against  the  reference  output  to  ensure  that  the  RSFT- 
protected  program’s  externally  visible  behavior  is  correct. 

The  drawback  of  this  kind  of  simulation  is  that  it  only  simulates  architectural  states 
including  register  files,  instruction  decoding,  and  program  counter.  Some  work  also  sim¬ 
ulates  transient  faults  at  a  microarchitectural  level  [43],  and  an  RTL  level  [61].  Because 
transient  faults  naturally  occurs  at  the  hardware  logic  level,  simulation  at  a  lower  level, 
such  as  RTL  level,  has  better  accuracy.  For  example,  the  hardware  simulation  can  inject 
faults  into  the  bypass  network,  which  is  not  visible  to  the  architectural  view.  However, 
this  approach  is  usually  prohibitively  slow  for  sizable  programs  such  as  SPEC  benchmark 
suites.  Mukherjee  et  al.  proposed  methodologies  to  estimate  the  architectural  vulnerability 
factors(AVF)  at  microarchitecture  level  by  sampling  the  AVF  of  a  given  structure  in  a  given 
time  [31].  These  techniques  can  only  simulate  and  cover  a  fraction  of  the  program. 

The  reliability  evaluation  in  this  dissertation  is  pessimistic.  For  example,  some  benign 
faults,  such  as  the  faults  that  hit  the  processor  but  are  automatically  correctly  before  the 
value  is  loaded,  or  is  microarchitecturally  masked,  are  not  simulated  and  represented  in 
this  dissertation.  However,  since  these  are  benign  faults  and  does  not  change  program 
behavior,  RSFT  does  not  attempt  to  detect  or  recovery  from  them. 
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5.3.2  Memory  Fault  Simulation 

Previous  work  only  simulates  transient  faults  in  registers.  The  instruction- level  redundancy 
and  redundant  multi-threading  approaches  only  maintain  one  memory  state,  and  rely  on 
ECC  memory  to  protect  programs  against  memory  transient  faults.  Process-based  redun¬ 
dancy  techniques  maintain  multiple  memory  states  and  can  provide  transient  fault  protec¬ 
tion  for  memory.  However,  prior  work  using  process-based  redundancy,  including  PLR, 
did  not  do  any  experiments  regarding  memory  transient  faults.  This  thesis  is  the  first  to 
simulate  and  evaluate  the  detection  of  transient  faults  in  memory  for  software  transient 
fault  tolerance  techniques. 

Similar  to  injecting  faults  into  register  files,  memory  fault  injection  involves  a  profile 
run,  and  a  random  selection  of  a  program  point.  Subsequently,  memory  fault  injection 
randomly  selects  a  virtual  memory  address  owned  by  one  process,  and  randomly  flips  a 
bit  of  the  value  stored  in  that  memory  address.  This  memory  address  may  contain  data  or 
text  of  the  program.  Memory  faults  are  injected  into  all  running  processes  (the  monitoring 
process,  and  two  program  instances),  with  the  likelihood  of  the  occurrence  of  the  faults 
being  proportional  to  the  amount  of  memory  used  by  each  process.  Each  program  was 
executed  3000  times  with  one  memory  fault  injected  each  time. 

Memory  faults  are  only  injected  into  the  physical  memory.  Although  a  process  may  use 
a  large  amount  of  virtual  memory,  the  number  of  physical  memory  used  is  typically  much 
smaller  than  the  virtual  memory  pages.  One  reason  is  that  virtual  memory  space  includes 
not  only  stacks  and  heap,  but  also  binary  file  and  dynamically  loaded  libraries.  Not  all  the 
pages  are  accesses  during  program  execution.  Another  reason  is  that  only  recently  used 
pages  are  swapped  into  the  main  memory  for  locality. 

Because  we  are  simulating  real  transient  faults  which  occur  in  hardware,  memory  con¬ 
tent  that  is  not  physically  in  memory  is  not  supposed  to  be  changed  by  any  transient  fault. 
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As  a  result,  like  the  register  fault  simulation,  this  dissertation  also  picks  random  point 
in  program  execution,  and  halts  the  process  execution  at  that  point.  After  the  process  is 
paused,  the  injection  process  walks  through  the  physical  page  table  of  the  randomly  se¬ 
lected  process,  make  a  record  of  pages  that  are  currently  in  memory,  and  randomly  pick  a 
bit  in  a  random  memory  address  that  belongs  to  those  pages. 

To  demonstrate  the  advantage  of  memory  fault  tolerance  using  RSFT,  the  evaluation  in 
this  dissertation  flips  two  bits  of  the  same  word  stored  in  that  random  memory  address.  The 
reason  is  that  ECC  memory  provides  fault  tolerance  against  transient  faults.  Guaranteed  to 
detect  and  recover  from  single-bit  flip  faults,  ECC  memory  may  fail  when  multiple  bits  of 
the  same  protection  unit  are  flipped  at  the  same  time.  However,  when  a  particle  strikes  and 
flips  a  bit,  it  is  possible  that  the  same  particle  beam  may  flip  other  bits  that  are  physically 
near  the  already  altered  bit,  which  makes  ECC  memory  vulnerable.  RSFT  compares  the 
values  instead  of  setting  parity  values  of  bits,  therefore  can  detect  multiple-bit-flip  events. 

Memory  faults  injected  into  RSFT  itself  may  also  be  detected.  For  example,  if  a  fault 
changes  the  value  in  memory  that  stores  the  system  calls’  arguments  for  later  comparison, 
this  fault  will  be  detected  when  the  values  are  compared  against  values  from  the  other 
process  later.  Another  example  is  that  the  memory  fault  injected  may  be  transitively  passed 
on  to  one  of  the  processes  and  results  in  a  value  mismatch  in  later  transient  fault  detection. 
However,  if  an  injected  memory  fault  changed  a  value  in  memory  that  is  transitively  passed 
on  to  both  processes,  and  that  value  affects  the  final  output,  RSFT  will  not  be  able  to  detect 
it. 

The  disadvantage  of  this  memory  faults  injection  method  is  that  the  simulation  is  not 
at  hardware  level.  Previous  work  on  simulation  memory  faults,  such  as  the  ECC  tech¬ 
nique,  evaluates  its  reliability  by  simulating  transient  faults  at  the  hardware  level.  Hardware 
level  simulation  is  more  accurate  than  software  level  simulation.  However,  the  simulation 
methodology  used  in  this  dissertation  can  approximate  hardware  simulation  as  close  as  soft- 
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Processor 

Intel  Core  2  Duo®Q6600 

Processor  Speed 

2.4GHz 

L2  Cache  size 

4096KB 

RAM 

8GB 

Operating  System 

Linux  2.6.38 

Compiler 

gcc  4.6 

Table  5.2:  Platform  details 


ware  simulation  can  get.  Because  RSFT  is  an  application-level  fault  tolerance  technique, 
memory  faults  in  the  memory  space  used  by  the  unprotected  programs  are  not  simulated  or 
evaluated. 

This  chapter  presents  an  thorough  evaluation  on  a  set  of  23  for  reliability,  performance, 
and  resource  consumption.  The  methodology  used  to  evaluate  RSFT  is  described  in  this 
chapter. 


5.4  Experiment  Setup 

The  program  binaries  for  all  benchmark  programs  are  generated  using  compilation  option 
-02  with  gcc.  The  performance  of  the  original  binaries  are  used  as  the  baseline  throughout 
the  evaluation.  Both  performance  and  reliability  evaluation  were  conducted  by  executing 
the  binaries  with  RSFT  on  a  2.4Ghz  Intel  Core  2  Quad-Core  (Q6600)  machine  running 
Linux  2.6.38. 


5.5  Benchmark  Programs 

The  benchmark  program  used  for  evaluation  in  this  dissertation  are  from  SPEC  CPU  2000 
and  SPEC  CPU  2006  benchmark  suites.  These  benchmarks  were  selected  to  be  included 
in  SPEC  benchmark  suites  as  representatives  of  real-world  applications,  including  general- 
purpose  applications  and  scientific  applications.  For  example,  433. mile  is  developed  by 
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1000000 


Figure  5.1:  Number  of  system  calls  in  each  benchmark  (log  scale) 


the  MIMD  Lattice  Computation  (MILC)  collaboration  for  doing  simulations  of  four  di¬ 
mensional  SU(3)  lattice  gauge  theory  on  MIMD  parallel  machines.  This  code  is  used  for 
millions  of  node  hours  at  DOE  and  NSF  supercomputer  centers.  The  programs  selected  for 
evaluation  in  this  dissertation  were  chosen  based  on  availability. 

Figure  5.1  shows  the  number  of  system  calls  issued  from  each  unprotected  original 
benchmark  program.  Due  to  the  huge  variation  of  number  of  system  calls  (from  several 
hundreds  to  a  couple  of  hundred  of  thousands),  the  number  of  system  calls  are  shown  in  the 
log  scale  of  10.  Among  all  298  system  calls  defined,  about  45  of  them  are  actually  issued 
in  the  SPEC  benchmark  suite. 
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Figure  5.2:  Normalized  System  Calls  Categories  in  each  benchmark 

5.6  Performance  Evaluation  Methods 

5.6.1  Runtime  Overhead 

The  runtime  overhead  is  one  of  the  most  important  metrics  used  to  evaluate  RSFT.  The 
runtime  performance  of  RSFT  is  evaluated  in  comparison  with  the  non-speculative  exe¬ 
cution.  The  runtime  overhead  is  obtained  from  measuring  the  average  execution  time  of 
RSFT  from  five  runs  on  an  empty  unoccupied  multi-core  system,  normalized  to  the  original 
unprotected  program  execution  time. 

This  evaluation  was  conducted  using  time  on  Linux.  The  execution  time  includes 
RSFT  initialization,  new  process  forking,  checkpointing  (if  recovery  is  enabled),  program 
computation,  and  the  time  spent  in  the  operating  system. 
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5.6.2  Memory  consumption 

Although  duplicating  a  program  execution  doubles  the  amount  of  virtual  memory  used  by 
the  original  program  instance,  it  does  not  necessarily  duplicate  program’s  physical  memory 
consumption.  Compared  with  virtual  memory  size,  extra  physical  memory  usage  is  a  more 
accurate  metric  in  evaluating  how  much  memory  resources  are  consumed  that  can  be  used 
by  other  processes. 

To  compute  how  much  physical  memory  is  used  at  runtime  by  RSFT  and  the  duplicated 
program  instance,  this  dissertation  instruments  in  program  execution  to  evaluate  real-time 
physical  memory  usage.  This  is  achieved  through  stalling  process  execution  periodically, 
then  walking  through  the  page  table  of  stalled  process.  If  a  virtual  memory  page  in  page 
table  has  a  corresponding  physical  page  address,  it  is  considered  a  page  in  memory.  In 
addition,  the  tracing  process  itself  is  also  periodically  stalled  and  checked  for  physical 
memory  consumption  to  get  a  total  number  for  all  processes. 

5.6.3  Power  consumption 

Similar  to  memory  consumption,  duplicating  program  execution  does  not  necessarily  in¬ 
troduce  more  power  consumption.  The  evaluation  in  this  dissertation  uses  a  power  monitor 
attached  to  the  evaluation  platform.  Its  voltage  is  observed  during  program  runtime  to  com¬ 
pute  the  power  consumed  by  the  whole  evaluation  platform.  Since  the  evaluation  is  carried 
out  on  an  unoccupied  system,  the  power  consumption  is  the  power  used  by  the  hardware 
resources  that  execute  the  operating  system,  RSFT,  and  RSFT  protected  program. 

The  power  consumption  was  measured  in  a  separate  process  during  the  execution  of 
RSFT  protected  benchmark  programs.  The  measurement  was  done  via  sampling  power 
voltage  periodically,  according  to  a  pre-defined  sampling  rate.  In  this  dissertation,  the 
sampling  rate  is  set  to  once  per  10  second.  The  average  voltage  is  computed  as  the  average 
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of  all  samples.  The  power  consumption  is  proportional  to  the  time  spent  in  executing  the 
program,  multiplied  by  the  average  consumption  per  sample. 

This  dissertation  report  total  power  consumption  of  the  entire  machine,  including  CPU, 
memory,  disks  etc.  Full  system  power  was  measured  at  the  maximum  sampling  rate  (13 
samples  per  minute)  supported  by  the  power  distribution  unit  (AP7892  [55]).  Newer  chips 
have  power  monitoring  units  with  higher  sampling  rates  and  clock  gating  per  core.  They 
could  be  used  to  obtain  more  accurate  power  consumption  in  the  future. 

5.6.4  Transient  Fault  Recovery 

The  transient  fault  recovery  scheme  is  evaluated  using  the  following  metrics  :  (1)  Whether 
the  faulty  program  execution  can  be  recovered  or  not;  (2)  How  much  time  is  used  to  recover 
from  a  detected  failure. 

When  a  transient  fault  fail  is  detected,  RSFT  recovers  the  program  execution  using  a 
previous  verified  checkpoint.  Unlike  previous  solutions  that  must  restart  program  from 
the  very  beginning,  RSFT  maintains  the  last  verified  checkpoint,  and  verifies  again  right 
before  recovering  from  that  checkpoint.  However,  due  to  the  time  spent  on  checkpointing 
and  program  state  recovery,  it  is  not  free  to  recover  from  the  fault.  The  mean  time  to 
recovery  (MTTR)  is  also  evaluated  and  reported  in  Chapter  6. 
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Chapter  6 


Evaluation 

6.1  Reliability 

The  reliability  of  fault  tolerant  techniques  can  be  evaluated  by  their  windows  of  vulnerabil¬ 
ity  and  mean  time  to  fault  detection.  This  section  focuses  on  these  two  metric:  window  of 
vulnerability  and  mean  time  to  detection.  Both  were  measured  and  analyzed  with  register 
fault  simulation  and  memory  fault  simulation. 

6.1.1  Window  of  Vulnerability 

To  study  the  window  of  vulnerability  of  RSFT,  a  process  tracing  tool  (ptrace)  was  used 
to  simulate  single-event-upset  (SEU)  [41,  45,  59]  transient  faults  at  runtime  in  both  register 
files  and  the  memory  (in  a  way  similar  to  [20,  50,  21]).  Figure  6.1  shows  the  fault  distribu¬ 
tion  of  injected  faults.  The  x-axis  shows  the  faults  injected  into  register  files  (R)  or  memory 
(M)  for  each  benchmark  program.  The  y-axis  is  the  percentage  of  faults  in  each  category. 
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■ 

SDC 

■ 

Timeout 

■ 

Detected 

□ 

Benign 

Figure  6.1:  Register  (R)  and  Memory  (M)  Transient  Fault  Distribution 


Register  Fault  Tolerance 

The  R  columns  in  Figure  6.1  shows  the  experimental  results  from  injecting  faults  into  the 
register  file  at  program  runtime.  First,  a  profile  run  of  the  original  program  binary  is  timed 
to  estimate  how  long  it  may  take  to  execute  the  program.  Before  fault  injection,  our  tool 
randomly  selects  one  point  in  time,  one  random  program  instance,  one  random  bit  of  a 
register,  as  well  as  one  random  register  among  general-purpose,  floating  point,  XMM,  and 
flag  registers.  During  program  runtime,  this  fault  injection  simulation  issues  an  alarm  after 
a  random  period  of  time.  It  then  sends  a  signal  to  the  randomly  selected  process,  stops 
its  execution,  and  flips  the  random  bit  of  the  selected  register.  The  particular  program 
instance  then  continues  execution.  In  our  evaluation,  we  only  inject  faults  into  the  two 
redundant  program  instances,  but  not  into  the  monitoring  process.  Since  the  monitoring 
process  spends  very  little  CPU  time  (CPU  utilization  is  lower  than  0.01%),  it  is  extremely 
unlikely  that  a  transient  fault  occurs  while  the  monitoring  process  is  executing.  It  must  also 
be  noted  that  this  fault  injection  method  does  not  inject  faults  into  the  operating  system. 
Finally,  the  execution  result  of  the  fault-injected  program  is  compared  against  the  reference 
output  to  ensure  that  the  RSFT-protected  program’s  externally  visible  behavior  is  correct. 
Figure  6.1  shows  the  aggregated  fault  distribution  over  3000  runs. 
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Benchmark 

Name 

Occurrence  for  Register  Injected  Faults  (%) 

Occurrence  for  Memory  Injected  Faults  (%)  1 

Benign  Faults 

Detected  Faults 

Timeout 

SDC 

Benign  Faults 

Detected  Faults 

Timeout 

SDC 

008. espresso 

48.74 

51.23 

0.03 

0.0 

91.32 

7.47 

1.21 

0.0 

129. compress 

66.69 

33.31 

0.00 

0.0 

82.87 

17.10 

0.03 

0.0 

130.U 

66.13 

33.83 

0.03 

0.0 

85.15 

14.85 

0.0 

0.0 

134.perl 

41.50 

58.46 

0.03 

0.0 

88.90 

11.06 

0.04 

0.0 

164.gzip 

60.70 

39.23 

0.07 

0.0 

72.27 

27.73 

0.0 

0.0 

171. swim 

11.92 

87.97 

0.11 

0.0 

76.77 

23.23 

0.0 

0.0 

172.mgrid 

19.83 

80.10 

0.07 

0.0 

86.13 

13.87 

0.0 

0.0 

175.vpr 

29.36 

70.60 

0.04 

0.0 

85.27 

14.70 

0.03 

0.0 

179.art 

83.66 

16.25 

0.09 

0.0 

94.35 

5.61 

0.04 

0.0 

181.mcf 

60.70 

39.23 

0.07 

0.0 

72.27 

27.73 

0.0 

0.0 

183.equake 

64.94 

34.98 

0.08 

0.0 

88.90 

11.07 

0.03 

0.0 

197. parser 

45.76 

54.17 

0.07 

0.0 

83.37 

16.60 

0.03 

0.0 

256.bzip2 

26.92 

72.98 

0.10 

0.0 

94.78 

5.22 

0.0 

0.0 

401.bzip2 

65.89 

34.11 

0.0 

0.0 

87.10 

12.90 

0.0 

0.0 

429.mcf 

53.23 

46.74 

0.03 

0.0 

55.33 

44.67 

0.0 

0.0 

433.milc 

13.67 

86.25 

0.08 

0.0 

97.81 

2.19 

0.0 

0.0 

435.gromacs 

37.30 

62.57 

0.13 

0.0 

95.87 

4.10 

0.03 

0.0 

445.gobmk 

70.18 

29.79 

0.07 

0.0 

89.44 

10.56 

0.0 

0.0 

456.hmmer 

62.80 

37.13 

0.10 

0.0 

84.90 

15.10 

0.0 

0.0 

458.sjeng 

63.57 

36.33 

0.0 

0.0 

96.13 

3.83 

0.03 

0.0 

470. lbm 

49.42 

50.55 

0.03 

0.0 

86.54 

13.41 

0.05 

0.0 

482.sphinx3 

32.20 

67.73 

0.07 

0.0 

94.07 

5.88 

0.05 

0.0 

Table  6.1:  Distribution  of  the  outcomes  of  injected  faults.  The  numbers  correspond  to  3000 
runs  of  each  benchmark. 

The  values  for  register  fault  injection  in  Figure  6.1  show  that  a  large  percentage  of 
transient  faults  injected  are  benign  faults,  which  do  not  result  in  either  program  crash  or 
wrong  output.  Among  the  non-benign  faults,  all  faults  injected  are  detected  by  RSFT  via 
either  value  comparison,  customized  signal  handling,  or  timeout  watchdog.  Timeout  occurs 
very  rarely  (0.1%  on  average).  It  is  not  visible  in  Figure  6.1  but  is  shown  in  Table  6.1. 


Memory  Fault  Simulation 

The  instruction-level  redundancy  and  redundant  multi-threading  approaches  only  maintain 
one  memory  state,  and  rely  on  ECC  memory  to  protect  programs  against  memory  transient 
faults.  Process-based  redundancy  techniques  maintain  multiple  memory  states  and  can 
provide  transient  fault  protection  for  memory.  However,  prior  work  using  process-based 
redundancy,  including  PLR,  does  not  duplicate  physical  memory  pages  unless  one  or  more 
values  are  written  to  that  page.  This  reduces  their  fault  coverage  against  transient  faults 
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in  memory.  This  dissertation  is  simulates  and  evaluates  the  detection  of  transient  faults  in 
memory. 

The  M  columns  in  Figure  6.1  and  Table  6.1  demonstrate  the  experimental  results  from 
injecting  faults  into  memory  subsystems.  Similar  to  injecting  faults  into  register  files,  mem¬ 
ory  fault  injection  involves  a  profile  run,  and  a  random  selection  of  a  program  point.  Sub¬ 
sequently,  memory  fault  injection  randomly  selects  a  virtual  memory  address  owned  by 
one  random  process,  and  randomly  flips  a  bit  of  the  value  stored  in  that  memory  address, 
only  if  the  page  is  physically  in  memory.  This  memory  address  may  contain  data  or  text 
of  the  program.  Memory  faults  are  injected  into  all  running  processes  (the  monitoring  pro¬ 
cess,  and  two  program  instances),  with  the  likelihood  of  the  occurrence  of  the  faults  being 
proportional  to  the  amount  of  memory  used  by  each  process.  Each  program  was  executed 
3000  times  with  one  memory  fault  injected  each  time. 

Memory  faults  injected  into  RSFT  itself  may  also  be  detected.  For  example,  if  a  fault 
changes  the  value  in  memory  that  stores  the  system  calls’  arguments  for  later  comparison, 
this  fault  will  be  detected  when  the  values  are  compared  against  values  from  the  other 
process  later.  Another  example  is  that  the  memory  fault  injected  may  be  transitively  passed 
on  to  one  of  the  processes  and  results  in  a  value  mismatch  in  later  transient  fault  detection. 
However,  if  an  injected  memory  fault  changed  a  value  in  memory  that  is  transitively  passed 
on  to  both  processes,  and  that  value  affects  the  final  output,  RSFT  will  not  be  able  to  detect 
it. 

As  shown  in  the  Figure  6.1,  memory  faults  usually  do  not  result  in  erroneous  program 
output.  As  observed  in  experiments,  a  large  percentage  of  memory  faults  are  benign  faults. 
Only  an  average  of  12.83%  memory  faults  are  non-benign  faults  and  are  detected  by  RSFT. 
For  179. art,  the  original  program  only  consumes  10MB  memory  at  runtime,  therefore  the 
probability  of  memory  faults  occurring  in  the  monitoring  process  is  as  large  as  50%.  Ex¬ 
periment  results  shows  that  among  129  detected  faults  and  2  timeout  out  of  3000  runs,  16 
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Figure  6.2:  Mean  time  to  detection  for  RSFT-NoSpec  and  RSFTwith  and  without  fault 
recovery. 


detected  faults  and  1  timeout  are  caused  by  memory  faults  in  the  monitoring  process. 


6.1.2  Mean  Time  to  Detection 

Because  RSFT  is  a  process-level  redundant  execution  technique,  low  runtime  overhead  is 
achieved  via  infrequent  value  verification  and  system  call  interception.  However,  the  low 
overhead  also  comes  at  a  cost  of  delaying  transient  fault  detection  until  it  propagates  to 
some  value  that  escapes  RSFT’s  SoR.  To  understand  the  delay  of  transient  fault  detection 
in  RSFT,  the  mean  time  to  detecting  a  fault  is  evaluated  on  all  benchmark  programs,  as 
shown  in  Figure  6.2. 

Figure  6.2  demonstrates  that  for  most  transient  fault  injected,  RSFT  is  able  to  detect  the 
faults  within  10  seconds.  008. espresso,  134.perl  and  164.gzip  takes  less  than  10  seconds  to 
complete,  hence  the  mean  time  to  detection  is  always  less  than  10  seconds.  For  programs 
that  takes  a  long  time  to  complete,  such  as  SPEC2006  benchmark  prorgrams,  majority  of 
transient  faults,  if  not  benign,  are  detected  and  reported  within  300  seconds,  or  5  minutes. 
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6.2  Performance 


This  section  discusses  the  performance  overhead  of  RSFT  ,  with  and  without  recovery.  The 
overhead  includes  runtime  overhead,  resource  consumption  overhead,  such  as  memory  and 
power. 

6.2.1  Runtime  Overhead 

Figure  6.3  shows  the  runtime  overhead  (vertical  axis  in  the  figure)  of  RSFT,  normalized 
with  respect  to  the  original  program  execution  without  any  transient  fault  detection.  The 
result  is  compared  with  RSFT  without  speculation  enabled.  We  also  evaluated  the  perfor¬ 
mance  overhead  of  dynamic  instrumentation  (PIN),  which  is  the  base  of  a  previous  related 
work  PLR  [49],  on  the  same  set  of  benchmarks.  This  is  measured  by  executing  the  original 
program  binary  within  the  PIN  framework  on  an  unloaded  machine,  without  any  instru¬ 
mentation.  The  average  overhead  of  the  framework  alone  is  as  high  as  48.57%.  Running 
redundant  instances  and  runtime  instrumentation  on  top  of  the  framework,  in  addition  to 
frequent  synchronization,  can  only  adds  more  runtime  overhead.  Compared  with  dynamic 
instrumentation  approaches,  such  as  PLR,  RSFT  provides  the  same  applicability,  but  better 
coverage  and  much  lower  runtime  overhead. 

RSFT-NoSpec  is  implemented  using  barrier  synchronizations  at  every  system  call,  the 
same  as  what  PLR  does  in  Figure  2.4,  but  without  the  dynamic  binary  instrumentation 
overhead  in  PLR.  This  implementation  does  not  speculate  any  program  state  after  a  system 
call  is  completed.  Any  program  instance  that  issues  a  system  call  first  stops  program  exe¬ 
cution,  waits  until  the  other  instance  issues  the  same  call,  then  starts  executing  the  system 
call.  Another  synchronization  is  required  upon  returning  from  the  system  call  execution  in 
kernel.  The  return  values  of  the  system  call  is  duplicated  in  both  program  instances  at  that 
point. 
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]  RSFT-Detection  without  Speculation 
]  RSFT-Detection  with  Speculation 
]  RSFT-Recovery  with  Speculation 


Figure  6.3:  Performance  overhead  for  RSFT-Detect  with  and  without  speculation,  and 
RSFT-Recover 

Some  benchmarks,  such  as  183.equake,  feature  system  calls  in  inner  loops.  This  results 
in  a  large  amount  of  inter-process  barrier  synchronization,  if  speculation  is  not  available. 
Frequent  synchronizations  prevent  overlapping  of  useful  computation  with  fault  checking, 
hence  the  synchronization  overhead  is  placed  on  the  critical  path.  As  a  result,  running 
RSFT  without  speculation  on  183.equake  adds  45.86%  overhead  compared  with  the  un¬ 
protected  sequential  program,  while  RSFT  (with  speculation)  added  only  16.60%  perfor¬ 
mance  overhead.  Similarly,  171. swim  and  433.milc  share  the  same  patterns,  and  gained 
huge  performance  improvement  from  speculation. 

In  this  evaluation,  RSFT  never  misspeculates  because  the  system  calls’  return  values 
are  all  correctly  speculated.  In  cases  where  misspeculation  occurs,  it  costs  merely  a  couple 
of  millisecond  for  the  program  to  recover  from  a  misspeculation. 

6.2.2  Physical  Memory  Usage 

As  a  technique  that  maintains  multiple  memory  states,  RSFT  adds  considerable  memory 
consumption  to  the  system.  A  program  and  its  redundant  copy  together  occupy  twice  as 
much  physical  memory  as  the  original  unprotected  program.  RSFT  itself,  however,  only 
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]  RSFT-Detection  with  Speculation 
]  RSFT-Recovery  with  Speculation 


Figure  6.4:  Physical  memory  overhead  for  RSFT  with  and  without  fault  recovery. 

maintains  several  static  objects  and  dynamically  allocates  (and  frees)  some  memory  space 
to  buffer  data  system  call  arguments  for  transient  fault  detection.  It  consumes  around 
9.5MB  virtual  memory  at  peak,  since  it  is  merely  a  thin  layer  between  the  applications 
and  the  underlying  systems. 

Figure  6.5  demonstrates  the  physical  memory  consumption  measure  at  program  run¬ 
time.  The  memory  overhead  of  both  RSFT-NoSpec  and  RSFT  with  and  without  fault  re¬ 
covery  are  measured  and  compared  in  this  Figure  as  well. 

Previous  process-level  redundant  execution  approaches,  such  as  PLR,  use  triple-modular 
redundancy  for  fault  recovery.  As  a  result,  their  implementation  must  features  three  iden¬ 
tical  program  instances  in  addition  to  the  tracing  process,  adding  even  more  overhead  to 
memory  consumption.  Consequently,  the  implementation  of  RSFT  introduces  more  mem¬ 
ory  overhead  than  the  other  implementations.  Compared  with  fault  recovery  using  triple 
redundancy  and  triple  memory  usage,  RSFT  provides  a  method  to  recover  from  transient 
faults  fast,  while  only  doubling  the  physical  memory  usage. 

Because  RSFT’s  random  physical  memory  walking  can  be  enabled  and  disabled  at  run¬ 
time,  RSFT  can  use  less  physical  memory  at  the  cost  of  reduced  fault  coverage  against 
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memory  transient  faults. 

Figure  6.5  indicates  that  the  tracer  program  itself  does  introduce  more  physical  memory 
usage  to  the  system.  Some  benchmark  programs  have  smaller  memory  footprint  by  itself. 
For  example,  179. art  only  uses  ttt  physical  memory  pages  at  its  peak.  Because  RSFT’s 
tracing  process  itself  uses  mmm  memory  pages  at  runtime  to  keep  its  own  data  structure 
and  stack,  RSFT  adds  more  memory  pressure  on  programs  like  179. art.  Note  that  the 
software  programs  are  becoming  larger  and  more  complex  than  ever  with  the  evolve  of 
modern  architecture,  the  memory  footprint  of  programs  is  also  becoming  bigger  than  ever. 
An  example  is  that  all  SPEC  2006  benchmark  programs  consume  much  more  memory  than 
SPEC  2000  programs.  Since  the  memory  consumption  of  RSFT  itself  remains  relatively 
constant,  the  memory  pressure  that  RSFT  adds  to  program  execution  will  also  become 
relatively  smaller. 

Although  the  memory  consumption  overhead  seems  to  be  high  compared  with  the  orig¬ 
inal  memory  usage,  the  real  extra  memory  consumption,  measured  in  meagbytes,  is  not 
significant.  As  the  main  memory  is  becoming  an  abundant  resource  in  modern  architec¬ 
ture,  the  extra  memory  used  by  RSFT  is  usually  not  a  concern.  Notice  that  using  extra 
memory  may  introduce  more  swapping  between  main  memory  and  the  disk.  This  overhead 
is  included  in  performance  measured  in  Figure  6.3. 

6.2.3  Power  Consumption 

The  power  consumption  of  RSFT  with  and  without  fault  recovery  is  also  evaluated  in  this 
dissertation  for  thorough  understanding  of  the  overhead  of  this  technique.  The  computation 
in  the  redundant  process  and  computation  and  the  tracer  process  uses  extra  CPU  cycles, 
hence  increasing  power  usage  as  well. 

However,  the  evaluation  shows  that  the  power  consumption  of  RSFT  protected  pro- 


71 


]  RSFT-Detection  with  Speculation 
]  RSFT-Recovery  with  Speculation 


Figure  6.5:  Physical  memory  overhead  for  RSFT  with  and  without  fault  recovery. 

I  i  RSFT-Detection  with  Speculation 

i  i  RSFT-Recovery  with  Speculation 


Figure  6.6:  Power  consumption  overhead  for  RSFT  with  and  without  fault  recovery. 

grams  is  no  more  than  the  original  program,  using  the  evaluation  method  presented  the 
Chapter  5.  Figure  6.6  shows  the  power  consumption  of  RSFT  with  and  without  fault  re¬ 
covery,  normalized  to  the  unprotected  original  program. 

The  experiments  show  that  the  power  consumption  at  any  given  sampling  moment  re¬ 
mains  the  same  for  the  original  program,  and  the  duplication  execution  protected  by  RSFT. 
As  a  result,  the  power  consumption  overhead  is  proportional  to  the  runtime  overhead  of 
RSFT  with  and  without  recovery. 
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6.2.4  Checkpointing  Overhead 

The  checkpointing  method  in  RSFT  involves  very  low  runtime  overhead.  RSFT  clones 
child  processes  from  both  worker  processes  and  verifies  if  they  are  identical  using  a  back¬ 
ground  checker.  The  original  two  worker  processes  also  have  to  pay  the  cost  of  memory 
page  copy-on- write.  If  any  physical  page  in  memory  is  written,  the  original  worker  pro¬ 
cesses  will  have  to  create  another  physical  page  in  memory,  potentially  misses  the  cache. 

The  overhead  of  checkpointing  is  negligible  if  it  is  performed  infrequently.  However,  in 
cases  where  frequent  checkpointing  is  needed,  the  performance  overhead  can  be  higher  than 
desired.  Figure  6.7  shows  how  much  performance  overhead  changes  with  the  frequency  of 
checkpointing. 

In  the  performance  evaluation  above,  RSFT  performs  checkpointing  either  every  5000 
system  calls,  or  once  per  run,  whichever  is  greater. 

6.2.5  Transient  Fault  Recovery  Overhead 

Another  metric  is  how  fast  RSFT  can  recover  a  program  from  a  transient  fault  caused  fail¬ 
ure.  By  using  a  process  duplication  and  memory  copy-on-write  checkpointing  method, 
transient  fault  recovery  only  involves  waking  up  the  checkpoint  processes  and  continue  ex¬ 
ecution  from  the  point  of  last  checkpoint.  Hence  recovery  process  is  very  fast  in  RSFT. 
Experiments  show  that  recovering  from  a  checkpoint  process  takes  merely  half  a  millisec¬ 
ond. 
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Figure  6.7:  Checkpointing  overhead  for  RSFT  with  and  without  fault  recovery. 
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Chapter  7 


Related  Work 


Chapter  2  introduced  the  most  related  transient  fault  tolerant  techniques.  This  section  dis¬ 
cusses  some  additional  related  work. 

DIVA  [63]  uses  additional  hardware  checkers  to  provide  fault  protection.  DieHard  [6] 
proposed  by  Berger  et  al.  uses  redundancy  on  general-purpose  machines  for  memory  fault 
tolerance.  Exterminator  [33]  uses  process  replicas  to  detect  memory  errors  with  high  prob¬ 
ability.  RSFT  also  detects  transient  faults  in  memory,  as  well  as  register  files. 

Tapus  et  al.  [53]  introduce  a  syntax  and  an  operational  semantics  for  speculative  exe¬ 
cution  for  reliability  and  fault  tolerance.  This  work  proves  that  the  speculative  execution 
model  is  equivalent  to  the  non-speculative  model.  Weaver  et  al.  [64]  and  Vijaykumar  et 
al.  [58]  use  program  behavior  to  direct  transient  fault  detection.  These  techniques  follow 
the  propagation  of  faults  through  the  program  to  reduce  unnecessary  replication.  RSFT 
achieves  the  same  goal  by  only  detecting  non-benign  faults  when  they  are  about  to  affect 
program  output.  Gaiswinkler  et  al.  [13]  use  the  compiler  to  generate  diverse  binaries  for  a 
program,  thus  detecting  a  majority  of  faults  that  affect  register  values.  However,  this  tech¬ 
nique  has  a  lower  fault  coverage  than  RSFT,  and  may  produce  false-positives.  Aidemark 
et  al.  [1]  propose  methods  to  detect  software  errors  by  executing  an  application  multiple 
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times  and  majority  voting. 

Lee  et  al.  propose  Respec,  an  online  multiprocessor  replay  technique  using  specula¬ 
tive  logging  for  externally  deterministic  replay  [24].  This  technique  optimistically  logs 
less  information  about  shared  memory  dependencies  than  needed  for  deterministic  replay. 
If  the  replayed  process  diverges  from  recorded  process,  mis  speculation  recovery  is  per¬ 
formed.  This  paper  and  Respec  follow  the  same  idea  that  only  externally  visible  behavior 
and  final  state  of  the  program  must  be  correct.  Respec  was  introduced  to  replay  externally 
visible  behavior  of  shared  memory  multithreaded  programs  on  commodity  multiprocessor 
architecture.  Similarly,  RSFT  can  also  be  extended  to  provide  transient  fault  detection  for 
multithreaded  programs  with  deterministic  externally  visible  behavior. 

Daniel  et  al.  [11]  explored  the  possibility  of  using  process  monitoring  utilities  of  Unix 
systems  for  fault  tolerance  on  cluster  platforms  with  NFS.  Process  monitoring  utilities 
are  also  exploited  for  related  research  topics,  such  as  fault  injection  and  program  secu¬ 
rity  checking  [50,  21,  10,  20,  62].  Jarboui  et  al.  present  a  software-implemented  fault 
injection  technique  to  identify  different  techniques  for  generating  different  software  faults 
for  operating  systems  [20].  In  this  work,  they  use  the  UNIX  ptrace  function  to  trap  ker¬ 
nel  calls  issued  by  the  process  where  the  faults  are  being  injected.  Sieh  implements  a  fault 
injector  using  the  UNIX  ptrace  process  monitoring  interface  that  can  inject  transient 
faults  into  most  of  the  CPU  registers,  FPU  and  FPA  registers,  and  into  the  virtual  address 
space  of  the  running  process  [50].  FERRARI  [21]  uses  the  UNIX  ptrace  function  to 
corrupt  the  memory  image  of  a  process  at  runtime.  The  ptrace  function  is  used  to  insert 
software  trap  instructions  at  the  specific  instruction  address  where  a  fault  is  to  be  injected. 
FERRARI  can  inject  faults  into  the  data  and  code  segments  of  a  running  process  as  well 
as  the  registers  and  part  of  the  main  memory  used  by  that  process.  It  can  also  intercept  the 
system  calls  made  by  the  running  process,  and  change  their  return  values.  Buchacker  et 
al.  develop  a  framework  for  testing  the  fault-tolerance  of  systems,  where  they  inject  faults 
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into  a  simulated  system  of  Linux  machines  using  the  ptrace  interface  [10].  This  paper 
uses  similar  methods  to  inject  faults  into  register  files  and  main  memory  to  study  the  fault 
coverage  of  RSFT. 
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Chapter  8 


Conclusion  and  Future  Work 


This  dissertation  presents  a  software-only  speculative  transient  fault  tolerance  system  called 
RSFT.  This  system  advances  the  state-of-the-art  transient  fault  tolerance  techniques  by 
achieving  lowest  runtime  overhead  and  highest  fault  coverage.  This  chapter  summarizes 
the  work  presented  in  previous  chapters  and  concludes.  Section  8.2  discusses  potential 
future  research  on  this  topic. 

8.1  Conclusions 

Architectural  trends  toward  smaller  transistors,  higher  transistor  counts,  and  lower  core 
voltage  make  transient  faults  a  more  critical  reliability  concern  than  ever.  Redundant  hard¬ 
ware  provides  transient  fault  detection  at  the  cost  of  additional  chip  area  and  design  cost. 
Software  redundancy  is  more  appealing  for  its  flexibility  and  low  cost.  However,  even 
the  best  available  software  techniques  for  transient  fault  detection  have  large  performance 
overhead. 

This  dissertation  presents  Runtime  Speculative  Software-only  Fault  Tolerance(RSFT), 
the  fastest  software  fault  detection  technique  to  date.  Combining  OS-level  process  monitor- 
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ing,  value  speculation,  and  efficient  misspeculation  recovery,  RSFT  provides  transient  fault 
tolerance  with  only  6.17%  runtime  overhead.  With  the  flexibility  of  software  and  the  low 
runtime  overhead,  RSFT  provides  a  practical  transient  fault  protection  scheme  for  modern 
multicore  processors. 

This  dissertation  introduces  a  complete  system  for  transient  fault  tolerance,  including 
speculative  fault  detection,  runtime  lightweight  checkpointing  and  background  memory 
scrubbing  for  fault  recovery.  This  system  provides  both  the  ease  of  use  and  high  perfor¬ 
mance  as  well  as  full  fault  coverage.  This  dissertation  also  study  and  evaluate  the  effect  of 
transient  fault  hitting  a  program’s  physical  memory  space. 

8.2  Future  Directions 

Potential  future  research  directions  based  on  the  contribution  of  this  dissertation  include 
configurable  fault  protection  and  runtime  performance  tuning.  The  software-only  specula¬ 
tive  fault  tolerance  techniques  proposed  in  this  dissertation  provides  full  fault  protection  at 
the  application  level.  For  large  software  systems  that  do  not  require  high  reliability  for  all 
its  components,  it  may  be  beneficial  to  protect  only  the  critical  code  regions. 

For  example,  the  numerical  computation  of  a  bank’s  online  banking  system  is  very  sen¬ 
sitive  to  transient  faults  and  must  be  fault  tolerant.  However,  the  graphic  interface  itself  of 
an  online  banking  system  is  not.  Missing  one  pixel  or  changing  the  color  of  one  pixel  on 
the  screen  is  not  critical  and  seldom  noticed,  as  long  as  the  fault  does  not  crash  the  sys¬ 
tem.  Different  programs  may  need  different  level  of  protection,  with  different  granularity 
as  well.  Hardware  fault  tolerant  solutions  will  not  be  able  to  adjust  its  policy  depending 
on  which  program  is  currently  scheduled  to  run  on  a  particular  processor.  Software  solu¬ 
tions,  especially  runtime  techniques  like  RSFT  is  able  to  adjust  to  different  fault  tolerance 
requirements  after  the  system  is  deployed.  This  can  be  realized  via  either  programmer  spec- 
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ification,  or  runtime  notification.  The  tracer  process  can  only  start  fault  tolerant  execution 
once  an  environmental  variable  is  set  or  a  signal  is  received. 

Another  natural  extension  of  this  dissertation  is  applying  the  techniques  described  to 
distributed  applications  with  non-deterministic  outputs.  This  requires  redefining  faulty  be¬ 
havior  of  a  program,  as  a  program’s  redundant  copy  can  behave  differently  from  itself. 
One  example  is  that  a  network  application  may  send  out  heartbeat  packets  to  the  outside 
world.  When  a  misspeculation  occurs  in  one  of  the  processes  in  RSFT,  we  have  to  know 
which  system  calls  can  be  re-executed.  This  may  require  user  level  interference,  such  as 
pre-defining  program  behaviorin  a  way  that  RSFT  can  understand. 

In  the  future,  the  methods  use  to  evaluate  and  predict  the  reliability  of  a  processor  will 
be  greatly  appreciated.  Recent  research  has  attempted  blah  blah.  These  techniques  can  be 
used  to  dynamically  determine  the  fault  rate  at  runtime  to  adaptive  configure  checkpointing 
frequency  as  a  response  to  the  environmental  changes. 

Formal  model  checking  methods  have  been  proposed  to  verify  transient  fault  tolerance 
techniques  [37,  27,  23].  Proposals  using  symbolic  execution  or  theorem  proving  have  been 
applied  to  several  existing  fault  tolerance  techniques  such  as  SWIFT.  However,  verification 
for  binary  level  transient  fault  tolerance  techniques  still  remains  an  open  problem.  Pattabi- 
raman  et  al.  [36]  proposed  SimPLIFIED,  a  symbolic  fault  injection  framework  for  assembly 
level  programs.  However,  this  approach  has  high  constraints.  For  example,  SimPLIFIED 
only  works  for  programs  without  floating  point  arithmetic,  and  is  only  applicable  to  pro¬ 
grams  with  fault  detection  code  embedded  at  assembly  level.  To  verify  and  check  for  the 
correctness  of  RSFT,  verification  for  multi-process  programs  must  be  provided. 
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8.3  Closing  Remarks 


Despite  of  all  the  efforts  researchers  made  throughout  the  years,  transient  fault  tolerance 
still  remains  one  of  the  most  difficult  problems  to  solve.  Among  all  the  techniques  proposed 
so  far,  hardware  or  software,  most  of  them  are  far  from  being  deployed  in  real  commodity 
systems.  The  complexity  and  high  cost  of  hardware  approaches  prevent  their  wide  adop¬ 
tion.  Existing  software  approaches,  limited  by  either  high  performance  penalty  or  limited 
fault  coverage  or  limited  applicability,  still  remain  in  research  papers  and  are  far  from  be¬ 
coming  practical. 

This  dissertation  takes  a  step  towards  deploying  a  practical  software  method  for  tran¬ 
sient  fault  tolerance,  using  the  existing  commodity  hardware  systems.  Although  RSFT  has 
some  window  of  vulnerability,  but  given  its  low  overhead  and  high  fault  coverage,  it  has 
the  potential  of  being  applied  to  existing  systems  and  provide  transient  fault  protection  at 
runtime. 
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